Diego Basch wrote:In my opinion, the only significant improvement would be the ability to reduce cloaking. Cloaked servers present a different page to a crawler based on two things: the user agent and the ip address range. If crawlers used random ip addresses and Mozilla as a user agent, cloakers would have a harder time telling them apart from regular users. This means to violate the robot.txt raccomendation. You can do this from a single location. What is the difference? Cloaking is one of the main reasons Google's relevance has decreased over time, this is why I believe a distributed crawling approach has some merit. Of course, it would be pointless if spammers could tamper with the crawling process.Diego. --- Antonio Gulli <[EMAIL PROTECTED]> wrote:Diego Basch wrote:he main problem with this approach is: How do you stop malicious users from reportingboguschanges?My issues are about the need of such approach: (Distributed spidering) ? Spidering costs are peanuts. Indexing and above all serving the queries are the main cost. -- "With a heavy dose of fear and violence, and a lot of money for projects, I think we can convince people that we are here to help them." LT. COL. NATHAN SASSAMAN NYtimes.com 7th, Dec 2003 http://www.di.unipi.it/~gulli/-------------------------------------------------------This SF.Net email is sponsored by: IBM Linux Tutorials Free Linux tutorial presented by Daniel Robbins, President and CEO of GenToo technologies. Learn everything from fundamentals to systemadministration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click_______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED]https://lists.sourceforge.net/lists/listinfo/nutch-developers __________________________________ Do you Yahoo!? Yahoo! Mail - More reliable, more storage, less spam http://mail.yahoo.com ------------------------------------------------------- This SF.Net email is sponsored by: IBM Linux Tutorials Free Linux tutorial presented by Daniel Robbins, President and CEO of GenToo technologies. Learn everything from fundamentals to system administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers -- "With a heavy dose of fear and violence, and a lot of money for projects, I think we can convince people that we are here to help them." LT. COL. NATHAN SASSAMAN NYtimes.com 7th, Dec 2003 http://www.di.unipi.it/~gulli/ |
- [Nutch-dev] fast database update Brice Tsakam
- Re: [Nutch-dev] fast database update Diego Basch
- Re: [Nutch-dev] fast database update Antonio Gulli
- Re: [Nutch-dev] fast database update Diego Basch
- Re: [Nutch-dev] fast database update Antonio Gulli
- Re: [Nutch-dev] fast database update Diego Basch
- Re: [Nutch-dev] fast database upda... Antonio Gulli
- Re: [Nutch-dev] fast database ... Diego Basch
- Re: [Nutch-dev] fast database update Pierre
- Re: [Nutch-dev] fast database update otisg