We don't have to ignore the robots.txt file. The
scheme could work like this:

Filter the known urls in our db with the latest robot
information known to us and assign them to the
crawlers. Each crawler will return files containing
the page (both plain and inverted), all the links
extracted (some could be filtered and assigned to
crawlers later), a fuzzy page signature to compare it
to the page fetched by other crawlers, etc.

At one of my previous employers we used to have a
custom database with dns mappings, mirrors and robot
exclusion information. The urls were filtered through
it before being assigned to the crawlers. This was a
good aproach since the crawlers were not doing link
extraction at the time, this was done as a separate
step after a first-round crawl. I think the robots.txt
information was updated at least once a week for a
random site and more often for popular sites. Yes, you
may miss some transitions and crawl newly excluded
urls once or twice, but it's not a big deal. I don't
remember getting too many complaints about this from
site owners.

Diego.

> >
> I agree: but this means to violate the robot.txt
> standard and not to allow
> legitimate esclusion from a web server (which is not
> a fair behaviour IMHO)
> 


__________________________________
Do you Yahoo!?
Yahoo! Mail - More reliable, more storage, less spam
http://mail.yahoo.com


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to