Hi - We are running Nutch 0.9 on a 9-machine cluster. We are doing a general web crawl for text documents (html, pdf, doc, txt, etc). We are getting about 580k documents fully indexed every 24 hours. Is this an expected level of throughput, or should it be higher? It seems low to me.
4 of the machines are dual proc w/ 2GHz Intel Xeons, 4GB RAM, and 226 GB hard disk; the remaining 5 are single proc w/ 2.4GHz Intel Celeron, 2GB RAM, and 226 GB hard disk. Aside from tuning properties to avoid memory problems and timeouts, the only modification we have made to the Nutch code is to store the full text of the documents we are crawling directly in the index (we need this for our application). Otherwise it is Nutch right out of the box. Any guidance is greatly appreciated! Thanks, Dave
