Author: Tim Hewitt
Email: [EMAIL PROTECTED]
Message:
I have run indexer -Cw and deleted all the old data between runs, so I think that any
existing links in the database are not being followed. I did not drop the database and
create a new one, but there were no records reported from an indexer -S report, so I
assume that my next indexer run would start from scratch. It certainly appears to do
so in the log file.
Only running one indexer at a time. Currently only indexing my website internally,
using mySQL as the database. The robots table in the database correctly contains the
entries from my robots.txt file.
The version I am running is 3.1.12 on Linux.
The reason I think the program is not following the robots.txt standard is based on
discussions with other robot writers. I'll see if I can get permission to share their
actual emails with you. One of them is involved in a major search engine web crawler.
The standard is a bit ambiguous in this section, but an entry such as:
/forums/myfile.php
should cause the following files to not be indexed:
/forums/myfile.php?s=123465
/forums/myfile.php3
/forums/myfile.php?anythinghere
The disallow match should be treated as a match if the beginning of the URL on the
site completely matches the Disallowed URL. That is, the Disallowed URL should be
treated as a leading substring of the URLs to be disallowed.
I love the search program by the way. Even with this little problem, the utility and
ease of use of this program is really wonderful.
I'm probably going to create a second config file that only indexes the specific
/forums/xxx.php URLs that I want added to my index. It looks like this is doable - at
least as a workaround.
Best regards,
-Tim
Reply: <http://search.mnogo.ru/board/message.php?id=2009>
___________________________________________
If you want to unsubscribe send "unsubscribe general"
to [EMAIL PROTECTED]