Hi - We are running Nutch 0.9 on a 9-machine cluster. We are doing a general
web crawl for text documents (html, pdf, doc, txt, etc). We are getting
about 580k documents fully indexed every 24 hours. Is this an expected level
of throughput, or should it be higher? It seems low to me.

 

4 of the machines are dual proc w/ 2GHz Intel Xeons, 4GB RAM, and 226 GB
hard disk; the remaining 5 are single proc w/ 2.4GHz Intel Celeron, 2GB RAM,
and 226 GB hard disk.

 

Aside from tuning properties to avoid memory problems and timeouts, the only
modification we have made to the Nutch code is to store the full text of the
documents we are crawling directly in the index (we need this for our
application). Otherwise it is Nutch right out of the box.

 

Any guidance is greatly appreciated!


Thanks,

Dave

Reply via email to