Re: [Nutch-dev] fast database update

Diego Basch Wed, 17 Mar 2004 05:35:20 -0800

The difference is obvious if you look at the situation
from the point of view of the cloaker. Suppose you
were trying to present different information to a
search engine and a person with a browser. There are a
few modifications you could apply to a web server. The
most obvious is to serve a different page based on the
user agent string.


Let's assume that we use the user agent string regular
browser. The next step for the cloaker is to identify
the ip address ranges used by known crawlers. Right
now, those are well known, just do a reverse lookup
for *.googlebot.com or *.inktomisearch.com, for
example.

If the ip addresses were unpredictable (random
machines from regular ISP networks), the cloaker has
to resort to heuristics. Here are some things a smart
cloaker could do:

- identify machines that download a sufficiently high
number of pages, or a small number of pages very
quickly.
- plant "invisible" links that are not meant to be
followed by humans, and take note of the machines that
fetch those pages.

Those two problems could be solved if the number of
crawlers were sufficiently high. Assume you had a
million crawlers. Each one could be assigned three
thousand urls, all from different hosts, to crawl over
24 hours.

Each url would have to be retrieved by more than one
crawler in order to detect cheating: a malicious
spammer could volunteer hundreds of machines in the
hope that he would "catch" some of his urls among his
randomly assigned sets and feed the crawlers arbitrary
content. That is where the voting system comes into
play:

- five random computers fetch url A, which belongs to
a  spammers
- one or (with a lot of bad luck) two of them belong
to the spammer, who feeds arbitrary content
- the other three machines report a page digest that
is substantially different from the other two

At that point there are two options: either the page
is discarded or the most voted content is accepted.

To summarize: the only way to defeat cloaking is to
successfully disguise crawlers as random web browsers.

Diego.

--- Antonio Gulli <[EMAIL PROTECTED]> wrote:
> Diego Basch wrote:
> 
> >In my opinion, the only significant improvement
> would
> >be the ability to reduce cloaking. Cloaked servers
> >present a different page to a crawler based on two
> >things: the user agent and the ip address range. If
> >crawlers used random ip addresses and Mozilla as a
> >user agent, cloakers would have a harder time
> telling
> >them apart from regular users.
> >  
> >
> 
> This means to violate the robot.txt raccomendation.
> You can do this from 
> a single location.
> What is the difference?
> 
> >Cloaking is one of the main reasons Google's
> relevance
> >has decreased over time, this is why I believe a
> >distributed crawling approach has some merit. Of
> >course, it would be pointless if spammers could
> tamper
> >with the crawling process.
> >
> >Diego.
> >
> >--- Antonio Gulli <[EMAIL PROTECTED]> wrote:
> >  
> >
> >>Diego Basch wrote:
> >>
> >>    
> >>
> >>>he main problem with this approach is:
> >>>
> >>>How do you stop malicious users from reporting
> >>>      
> >>>
> >>bogus
> >>    
> >>
> >>>changes?
> >>> 
> >>>
> >>>      
> >>>
> >>My issues are about the need of such approach:
> >>(Distributed spidering) ?
> >>Spidering costs are peanuts. Indexing and above
> all
> >>serving the queries  
> >>are the main cost.
> >>
> >>-- 
> >>"With a heavy dose of fear and violence, and a lot
> >>of money 
> >>for projects,  I think we can convince people that
> >>we are here 
> >>to help them." LT. COL. NATHAN SASSAMAN
> NYtimes.com
> >>7th, Dec 2003 
> >>
> >>http://www.di.unipi.it/~gulli/
> >>
> >>
> >>
> >>
> >>
> >>    
> >>
>
>-------------------------------------------------------
> >  
> >
> >>This SF.Net email is sponsored by: IBM Linux
> >>Tutorials
> >>Free Linux tutorial presented by Daniel Robbins,
> >>President and CEO of
> >>GenToo technologies. Learn everything from
> >>fundamentals to system
> >>
> >>    
> >>
>
>administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
> >  
> >
> >>_______________________________________________
> >>Nutch-developers mailing list
> >>[EMAIL PROTECTED]
> >>
> >>    
> >>
>
>https://lists.sourceforge.net/lists/listinfo/nutch-developers
> >
> >
> >__________________________________
> >Do you Yahoo!?
> >Yahoo! Mail - More reliable, more storage, less
> spam
> >http://mail.yahoo.com
> >
> >
>
>-------------------------------------------------------
> >This SF.Net email is sponsored by: IBM Linux
> Tutorials
> >Free Linux tutorial presented by Daniel Robbins,
> President and CEO of
> >GenToo technologies. Learn everything from
> fundamentals to system
>
>administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
> >_______________________________________________
> >Nutch-developers mailing list
> >[EMAIL PROTECTED]
>
>https://lists.sourceforge.net/lists/listinfo/nutch-developers
> >
> >  
> >
> 
> 
> -- 
> "With a heavy dose of fear and violence, and a lot
> of money 
> for projects,  I think we can convince people that
> we are here 
> to help them." LT. COL. NATHAN SASSAMAN NYtimes.com
> 7th, Dec 2003 
> 
> http://www.di.unipi.it/~gulli/
> 
> 
> 


__________________________________
Do you Yahoo!?
Yahoo! Mail - More reliable, more storage, less spam
http://mail.yahoo.com


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] fast database update

Reply via email to