Author: Tim Hewitt
Email: [EMAIL PROTECTED]
Message:
I have run indexer -Cw and deleted all the old data between runs, so I think that any 
existing links in the database are not being followed. I did not drop the database and 
create a new one, but there were no records reported from an indexer -S report, so I 
assume that my next indexer run would start from scratch. It certainly appears to do 
so in the log file.

Only running one indexer at a time. Currently only indexing my website internally, 
using mySQL as the database. The robots table in the database correctly contains the 
entries from my robots.txt file.

The version I am running is 3.1.12 on Linux.

The reason I think the program is not following the robots.txt standard is based on 
discussions with other robot writers. I'll see if I can get permission to share their 
actual emails with you. One of them is involved in a major search engine web crawler.

The standard is a bit ambiguous in this section, but an entry such as:

 /forums/myfile.php

should cause the following files to not be indexed:

 /forums/myfile.php?s=123465
 /forums/myfile.php3
 /forums/myfile.php?anythinghere

The disallow match should be treated as a match if the beginning of the URL on the 
site completely matches the Disallowed URL. That is, the Disallowed URL should be 
treated as a leading substring of the URLs to be disallowed.

I love the search program by the way. Even with this little problem, the utility and 
ease of use of this program is really wonderful.

I'm probably going to create a second config file that only indexes the specific 
/forums/xxx.php URLs that I want added to my index. It looks like this is doable - at 
least as a workaround.

Best regards,

-Tim

Reply: <http://search.mnogo.ru/board/message.php?id=2009>

___________________________________________
If you want to unsubscribe send "unsubscribe general"
to [EMAIL PROTECTED]

Reply via email to