Hello folks, We are in the starting phase of a project, and we are currently wondering whether Heritrix or Nutch is the best choice of crawler for us.
Our project: Basically, we're going to set up Hadoop and crawl the web for images. We will then run our own indexing software on the images stored in HDFS based on the Map/Reduce facility in Hadoop. We will not use other indexing than our own. Some particular questions: - Which crawler will handle crawling for images best? - Which crawler will best adapt to a distributed crawling system, in which we use many servers conducting crawling together? - Which crawler is/will be under most active development? Any views on this? Best Regards, Svein Willassen
