Nutch or Heritrix?

Svein Yngvar Willassen Sat, 05 Apr 2008 06:35:58 -0700

Hello folks,

We are in the starting phase of a project, and we are currently wondering
whether Heritrix or Nutch is the best choice of crawler for us.


Our project:

Basically, we're going to set up Hadoop and crawl the web for images.
We will then run our own indexing software on the images stored in HDFS
based on the Map/Reduce facility in Hadoop. We will not use other indexing
than our own.

Some particular questions:

- Which crawler will handle crawling for images best?
- Which crawler will best adapt to a distributed crawling system, in which we
  use many servers conducting crawling together?
- Which crawler is/will be under most active development?


Any views on this?


Best Regards,

Svein Willassen

Nutch or Heritrix?

Reply via email to