The difference is obvious if you look at the situation from the point of view of the cloaker. Suppose you were trying to present different information to a search engine and a person with a browser. There are a few modifications you could apply to a web server. The most obvious is to serve a different page based on the user agent string.
Let's assume that we use the user agent string regular browser. The next step for the cloaker is to identify the ip address ranges used by known crawlers. Right now, those are well known, just do a reverse lookup for *.googlebot.com or *.inktomisearch.com, for example. If the ip addresses were unpredictable (random machines from regular ISP networks), the cloaker has to resort to heuristics. Here are some things a smart cloaker could do: - identify machines that download a sufficiently high number of pages, or a small number of pages very quickly. - plant "invisible" links that are not meant to be followed by humans, and take note of the machines that fetch those pages. Those two problems could be solved if the number of crawlers were sufficiently high. Assume you had a million crawlers. Each one could be assigned three thousand urls, all from different hosts, to crawl over 24 hours. Each url would have to be retrieved by more than one crawler in order to detect cheating: a malicious spammer could volunteer hundreds of machines in the hope that he would "catch" some of his urls among his randomly assigned sets and feed the crawlers arbitrary content. That is where the voting system comes into play: - five random computers fetch url A, which belongs to a spammers - one or (with a lot of bad luck) two of them belong to the spammer, who feeds arbitrary content - the other three machines report a page digest that is substantially different from the other two At that point there are two options: either the page is discarded or the most voted content is accepted. To summarize: the only way to defeat cloaking is to successfully disguise crawlers as random web browsers. Diego. --- Antonio Gulli <[EMAIL PROTECTED]> wrote: > Diego Basch wrote: > > >In my opinion, the only significant improvement > would > >be the ability to reduce cloaking. Cloaked servers > >present a different page to a crawler based on two > >things: the user agent and the ip address range. If > >crawlers used random ip addresses and Mozilla as a > >user agent, cloakers would have a harder time > telling > >them apart from regular users. > > > > > > This means to violate the robot.txt raccomendation. > You can do this from > a single location. > What is the difference? > > >Cloaking is one of the main reasons Google's > relevance > >has decreased over time, this is why I believe a > >distributed crawling approach has some merit. Of > >course, it would be pointless if spammers could > tamper > >with the crawling process. > > > >Diego. > > > >--- Antonio Gulli <[EMAIL PROTECTED]> wrote: > > > > > >>Diego Basch wrote: > >> > >> > >> > >>>he main problem with this approach is: > >>> > >>>How do you stop malicious users from reporting > >>> > >>> > >>bogus > >> > >> > >>>changes? > >>> > >>> > >>> > >>> > >>My issues are about the need of such approach: > >>(Distributed spidering) ? > >>Spidering costs are peanuts. Indexing and above > all > >>serving the queries > >>are the main cost. > >> > >>-- > >>"With a heavy dose of fear and violence, and a lot > >>of money > >>for projects, I think we can convince people that > >>we are here > >>to help them." LT. COL. NATHAN SASSAMAN > NYtimes.com > >>7th, Dec 2003 > >> > >>http://www.di.unipi.it/~gulli/ > >> > >> > >> > >> > >> > >> > >> > >------------------------------------------------------- > > > > > >>This SF.Net email is sponsored by: IBM Linux > >>Tutorials > >>Free Linux tutorial presented by Daniel Robbins, > >>President and CEO of > >>GenToo technologies. Learn everything from > >>fundamentals to system > >> > >> > >> > >administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click > > > > > >>_______________________________________________ > >>Nutch-developers mailing list > >>[EMAIL PROTECTED] > >> > >> > >> > >https://lists.sourceforge.net/lists/listinfo/nutch-developers > > > > > >__________________________________ > >Do you Yahoo!? > >Yahoo! Mail - More reliable, more storage, less > spam > >http://mail.yahoo.com > > > > > >------------------------------------------------------- > >This SF.Net email is sponsored by: IBM Linux > Tutorials > >Free Linux tutorial presented by Daniel Robbins, > President and CEO of > >GenToo technologies. Learn everything from > fundamentals to system > >administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click > >_______________________________________________ > >Nutch-developers mailing list > >[EMAIL PROTECTED] > >https://lists.sourceforge.net/lists/listinfo/nutch-developers > > > > > > > > > -- > "With a heavy dose of fear and violence, and a lot > of money > for projects, I think we can convince people that > we are here > to help them." LT. COL. NATHAN SASSAMAN NYtimes.com > 7th, Dec 2003 > > http://www.di.unipi.it/~gulli/ > > > __________________________________ Do you Yahoo!? Yahoo! Mail - More reliable, more storage, less spam http://mail.yahoo.com ------------------------------------------------------- This SF.Net email is sponsored by: IBM Linux Tutorials Free Linux tutorial presented by Daniel Robbins, President and CEO of GenToo technologies. Learn everything from fundamentals to system administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers