Hi all, I wish to use Nutch in a distributed fashion for a total index size of ~1B pages. I had some related questions.
1. Is there something like an optimal segment size (I am referring to the subdirectory inside segments) from which Nutch obtains snippets. (a) Is there a downside to having an arbitrarily large number of subdirectories (say 100K). If the only downside is the "Too many open files" error, can I somehow get around that.? (b) Is there a downside to having a rather large segments subdirectory? Perhaps that might make the cost of retrieving a snippet rather high if Nutch will need to read a huge file that won't fit in a single disk block? 2. For an index this size, what might be the optimal number of pages to host on each machine in an EC2 cluster of Amazon, assuming that I want to keep the net response time under 3 seconds. I guess with latest hardware performance and costs, this answer might likely be different from a couple of years ago? 3. It appears that when using Nutch in a distributed fashion with the index partitioned by documents, the effective result retrieval is a bit inefficient since snippets are obtained for the search results from each of the machines in the cluster rather than the top 10 alone. Is this a concern (in other words is snippet retrieval generally more expensive than getting results from the index)? 4. Is there anything else I can do for fast response time by way of regulating sizes of segments subdirectories, nutch-default.xml and nutch-site.xml parameter settings etc? Most of the advice I find involves keeping the index alone in memory. Is there anything reasonable I can do if I have enough main memory to house/cache 10% of my index on each machine in the cluster? Thanks a ton, Vijay