I may be awfully wrong on that, but below is my plan for super-fast
crawling. I have prepared it for a venture that does not need it anymore,
but it looks like fun to do anyway. What would you all say: is there a need,
and what's wrong with the plan?

Thank you,
Mark

Fast Crawl Plan
=========

The goal of Nutch is exhaustive crawl. It works best for internal sites, or
intranets. It has known problems with wide web search. It is optimized for
correctness, and it is also an open-source engine for all dummies to use, so
it has polite crawling that is hard to get messed up, but it is not
optimized for performance.

I also see another area that slows it down: it uses a database. This makes
it easy to program, scale, and operate. It does not make it a fast runner.
All fast applications don't use databases.

Therefore, I would write my own crawler, optimized for performance. Here is
what my approach would be:

   - I would look at Nutch code for code snippets, for example, I would look
   at Fetcher.java, so as not to re-invent a wheel;
   - Having made the individual in-thread performance reasonably fast, I
   would do the following optimization steps;
   - Use a fast mechanism of real-time thread coordination, not database,
   but JavaSpaces (free GigaSpaces implementation);
   - Prepare URLs to do simultaneous fetching from different domains in
   different threads, and for more-or-less polite crawling within a domain;
   - Build-in blocking detection. Today we don't even know when and if we
   are blocked - and blocking can give time-outs;
   - Do it on one crawler for starters, but keep in mind that the code
   should later be scaled to a Hadoop cluster.

Mark

On Tue, Nov 24, 2009 at 11:32 AM, MilleBii <mille...@gmail.com> wrote:

> Why would DNS local caching work... It only is working if you are
> going to crawl often the same site ... In which case you are hit by
> the politeness.
>
> if you have segments with only/mainly different sites it is not/really
> going to help.
>
> So far I have not seen my quad core + 100mb/s + pseudo distributed
> hadoop  going faster than 10 fetch / s... Let me check the DNS and I
> will tell you.
>
> I vote for 100 Fetch/s not sure how to get it though
>
>
>
> 2009/11/24, Dennis Kubes <ku...@apache.org>:
> > Hi Mark,
> >
> > I just put this up on the wiki.  Hope it helps:
> >
> > http://wiki.apache.org/nutch/OptimizingCrawls
> >
> > Dennis
> >
> >
> > Mark Kerzner wrote:
> >> Hi, guys,
> >>
> >> my goal is to do by crawls at 100 fetches per second, observing, of
> >> course,
> >> polite crawling. But, when URLs are all different domains, what
> >> theoretically would stop some software from downloading from 100 domains
> >> at
> >> once, achieving the desired speed?
> >>
> >> But, whatever I do, I can't make Nutch crawl at that speed. Even if it
> >> starts at a few dozen URLs/second, it slows down at the end (as
> discussed
> >> by
> >> many and by Krugler).
> >>
> >> Should I write something of my own, or are their fast crawlers?
> >>
> >> Thanks!
> >>
> >> Mark
> >>
> >
>
> --
> Envoyé avec mon mobile
>
> -MilleBii-
>

Reply via email to