Working with the Disallow parameter in indexer.conf will limit what is indexed, but does not seem to allow passing through one page to index another.

I thought I would start slowly in configuring mnoGoSearch, but that doesn't seem to be working very well.  What I am looking to do is to index the archived messages in Mailman (http://www.list.org/) mailing lists.  Because Mailman stores each message a total of three times, (individually, monthly, and in the entire raw archive), each hit will actually return three links.  So, I figured that the way to go would be to use a filter to search the subdirectory level that the individual messages are stored.  So as to not pick up any of the archive index files, the filter should only index files with a number as the file name (i.e. 000346.html).  The subdirectory level is:

        http://web-test2.mc.duke.edu/pipermail/listname/year-month/

I was hoping that I could use two filters (one for the public archives and another for the private archives) using the Realm Regex [alias] parameter to search through all of the various lists like:

        Realm Regex ^http://web-test2.mc.duke.edu/pipermail/.*/[0-9]\.html file:/mailman/archives/public/.*/[0-9]\.html

It would seem that the URL parameter only functions as a subset of the Realm parameter.  Since the URL parameter does not appear to have Regex capabilities, a single filter will probably not work.

Does anyone hava a solution to what I want to do?

-- James

     
o o o o o o o . . .   _______________________ _______=======_T___
   
o      _____            |James Madill         | |Duke Univ Med Ctr|
>.][__n_n_| D[  ====|____  |[EMAIL PROTECTED]| | (919) 286-6384  |
 (________|__|_[____/____]_
|_____________________|_|_________________|
_/oo  O-O-O  `  oo     oo  'o^o^o           o^o^o` 'o^o           o^o`
-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
<http://www.duke.edu/~madil001/>



"James Madill" <[EMAIL PROTECTED]>
Sent by: [EMAIL PROTECTED]

08/07/01 14:16
Please respond to general; Please respond to "James Madill"

       
        To:        Alexander Barkov <[EMAIL PROTECTED]>
        cc:        [EMAIL PROTECTED]
        Subject:        Re: What is wrong with my indexer.conf?




OK.  Adding:


       URL http://arachnia.mc.duke.edu/madil001/


to the indexer.conf file with:


       
Realm http://arachnia.mc.duke.edu/madil001/
or

       Realm
http://arachnia.mc.duke.edu/madil00*

now works, but what I am actually wanting to do is to index only certain documents within that subdirectory structure, say files beginning with the letter "i".


       Realm
http://arachnia.mc.duke.edu/madil001/i*

doesn't work with the URL parameter above.  I can't quite understand what the Realm parameter actually does.  Perhaps I should be using the SERVER and ALLOW parameters?


-- James

   
o o o o o o o . . .   _______________________ _______=======_T___
 
o      _____            |James Madill         | |Duke Univ Med Ctr|
>
.][__n_n_| D[  ====|____  |[EMAIL PROTECTED]| | (919) 286-6384  |
(________|__|_[____/____]_
|_____________________|_|_________________|
_/oo  O-O-O  `  oo     oo  'o^o^o           o^o^o` 'o^o           o^o`

-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-

<http://www.duke.edu/~madil001/>


Alexander Barkov <[EMAIL PROTECTED]>
Sent by: [EMAIL PROTECTED]

08/07/01 04:41

       
       To:        [EMAIL PROTECTED], James Madill <[EMAIL PROTECTED]>

       cc:        

       Subject:        Re: What is wrong with my indexer.conf?




This is because Realm command does not insert any
start URLs like Server does. You have to add
start pages either using

URL http://arachnia.mc.duke.edu/madil001/

indexer.conf command or using

indexer -i -u http://arachnia.mc.duke.edu/madil001/



James Madill wrote:
>
> Can anyone tell me what is wrong with my indexer.conf file?
>
> I am running MnoGoSearch 3.1.17 on a Solaris 7 machine.  The install
> was made using the defaults, save for using a local database.
>  search.htm, spelld.conf, and indexer.conf are the distribution files
> save for the actual URLs to index.  The indexer.conf file works fine
> when using the Server parameter:
>
>         StopwordTable stopword
>
>         Disallow *.b    *.sh   *.md5  *.rpm
>         Disallow *.arj  *.tar  *.zip  *.tgz  *.gz   *.z     *.bz2
>         Disallow *.lha  *.lzh  *.rar  *.zoo  *.ha   *.tar.Z
>         Disallow *.gif  *.jpg  *.jpeg *.bmp  *.tiff *.tif   *.xpm

>  *.xbm *.pcx
>         Disallow *.vdo  *.mpeg *.mpe  *.mpg  *.avi  *.movie *.mov
>  *.dat
>         Disallow *.mid  *.mp3  *.rm   *.ram  *.wav  *.aiff  *.ra
>         Disallow *.vrml *.wrl  *.png
>         Disallow *.exe  *.com  *.cab  *.dll  *.bin  *.class *.ex_
>         Disallow *.tex  *.texi *.xls  *.doc  *.texinfo
>         Disallow *.rtf  *.pdf  *.cdf  *.ps
>         Disallow *.ai   *.eps  *.ppt  *.hqx
>         Disallow *.cpt  *.bms  *.oda  *.tcl
>         Disallow *.o    *.a    *.la   *.so
>         Disallow *.pat  *.pm   *.m4   *.am   *.css
>         Disallow *.map  *.aif  *.sit  *.sea
>         Disallow *.m3u  *.qt   *.mov
>
>         Disallow *D=A *D=D *M=A *M=D *N=A *N=D *S=A *S=D
>
>         Disallow Regex \.r[0-9][0-9]$ \.a[0-9][0-9]$ \.so\.[0-9]$
>         AddType        text/plain        *.txt  *.pl *.js *.h *.c *.pm
> *.e
>         AddType        text/html        *.html *.htm
>         AddType image/x-xpixmap        *.xpm
>         AddType image/x-xbitmap        *.xbm
>         AddType image/gif        *.gif
>
>         AddType        application/unknown *.*
>
>         Server http://arachnia.mc.duke.edu/madil001/
>
> But when I replace the Server parameter with either of the Realm
> parameters below,
>
>         Realm http://arachnia.mc.duke.edu/madil001/*
>         Realm Regex ^http://arachnia.mc.duke.edu/madil001/
>
> indexer fails to find any pages to index.
>
> -- James
>
>      o o o o o o o . . .   _______________________ _______=======_T___
>    o      _____            |James Madill         | |Duke Univ Med Ctr|
> >.][__n_n_| D[  ====|____  |[EMAIL PROTECTED]| | (919) 286-6384  |
>  (________|__|_[____/____]_|_____________________|_|_________________|
> _/oo  O-O-O  `  oo     oo  'o^o^o           o^o^o` 'o^o           o^o`
> -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
> <http://www.duke.edu/~madil001/>




Reply via email to