Hello Svein, Quick answers to your questions: - Nutch does not include an image crawler, though some people have started working on that a long time ago, and Archive.org is sponsoring this work/project.
- Nutch has a distributed fetcher. Not sure about Heritrix. - Nutch is being worked on, but not very aggressively at the moment. I think Heritrix development may be similar. I know of another company who is using a modified version of Nutch for image crawling. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- From: Svein Yngvar Willassen <[EMAIL PROTECTED]> To: [email protected] Sent: Saturday, April 5, 2008 3:35:26 PM Subject: Nutch or Heritrix? Hello folks, We are in the starting phase of a project, and we are currently wondering whether Heritrix or Nutch is the best choice of crawler for us. Our project: Basically, we're going to set up Hadoop and crawl the web for images. We will then run our own indexing software on the images stored in HDFS based on the Map/Reduce facility in Hadoop. We will not use other indexing than our own. Some particular questions: - Which crawler will handle crawling for images best? - Which crawler will best adapt to a distributed crawling system, in which we use many servers conducting crawling together? - Which crawler is/will be under most active development? Any views on this? Best Regards, Svein Willassen
