fast crawler / 100 mio pages

Georg Ochsner Fri, 12 Oct 2007 00:37:45 -0700

Hello list members,

I am looking for a solution to crawl about 100 million internet pages with a
(focused) crawler. The crawler should be able to handle reg expressions
concerning the URL and to have a depth limit for each domain (no real need
for sophisticated "topic" intelligence). The goal is to build up an index in
a database (e.g. MySQL).


- Which crawler would be the fastest solution out there on a single debian
machine (AMD Opteron 1212 HE, Debian Etch, 2 GB RAM)? I read about the
following crawlers, which are the fastest ones for my purpose or are there
other better ones?

iVia Data Fountains
Nutch
Combine
DataparkSearch
Terrier
Sherlock Holmes


- Would only 1 GB RAM matter in speed? 


Thank you very much!
Georg

fast crawler / 100 mio pages

Reply via email to