Hello list members, I am looking for a solution to crawl about 100 million internet pages with a (focused) crawler. The crawler should be able to handle reg expressions concerning the URL and to have a depth limit for each domain (no real need for sophisticated "topic" intelligence). The goal is to build up an index in a database (e.g. MySQL).
- Which crawler would be the fastest solution out there on a single debian machine (AMD Opteron 1212 HE, Debian Etch, 2 GB RAM)? I read about the following crawlers, which are the fastest ones for my purpose or are there other better ones? iVia Data Fountains Nutch Combine DataparkSearch Terrier Sherlock Holmes - Would only 1 GB RAM matter in speed? Thank you very much! Georg
