We don't have to ignore the robots.txt file. The scheme could work like this:
Filter the known urls in our db with the latest robot information known to us and assign them to the crawlers. Each crawler will return files containing the page (both plain and inverted), all the links extracted (some could be filtered and assigned to crawlers later), a fuzzy page signature to compare it to the page fetched by other crawlers, etc. At one of my previous employers we used to have a custom database with dns mappings, mirrors and robot exclusion information. The urls were filtered through it before being assigned to the crawlers. This was a good aproach since the crawlers were not doing link extraction at the time, this was done as a separate step after a first-round crawl. I think the robots.txt information was updated at least once a week for a random site and more often for popular sites. Yes, you may miss some transitions and crawl newly excluded urls once or twice, but it's not a big deal. I don't remember getting too many complaints about this from site owners. Diego. > > > I agree: but this means to violate the robot.txt > standard and not to allow > legitimate esclusion from a web server (which is not > a fair behaviour IMHO) > __________________________________ Do you Yahoo!? Yahoo! Mail - More reliable, more storage, less spam http://mail.yahoo.com ------------------------------------------------------- This SF.Net email is sponsored by: IBM Linux Tutorials Free Linux tutorial presented by Daniel Robbins, President and CEO of GenToo technologies. Learn everything from fundamentals to system administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers