Optimal size of a segments sub-directory and a couple of other questions relating to Nutch response times

Vijay Thu, 02 Jul 2009 18:15:43 -0700

Hi all,
      I wish to use Nutch in a distributed fashion for a total index size of
~1B pages. I had some related questions.


1. Is there something like an optimal segment size (I am referring to the
subdirectory inside segments) from which Nutch obtains snippets.
   (a) Is there a downside to having an arbitrarily large number of
subdirectories (say 100K). If the only downside is the "Too many open files"
error, can I somehow get around that.?
   (b) Is there a downside to having a rather large segments subdirectory?
Perhaps that might make the cost of retrieving a snippet rather high if
Nutch will need to read a huge file that won't fit in a single disk block?

2. For an index this size, what might be the optimal number of pages to host
on each machine in an EC2 cluster of Amazon, assuming that I want to keep
the net response time under 3 seconds. I guess with latest hardware
performance and costs, this answer might likely be different from a couple
of years ago?

3. It appears that when using Nutch in a distributed fashion with the index
partitioned by documents, the effective result retrieval is a bit
inefficient since snippets are obtained for the search results from each of
the machines in the cluster rather than the top 10 alone. Is this a concern
(in other words is snippet retrieval generally more expensive than getting
results from the index)?

4. Is there anything else I can do for fast response time by way of
regulating sizes of segments subdirectories, nutch-default.xml and
nutch-site.xml parameter settings etc? Most of the advice I find involves
keeping the index alone in memory. Is there anything reasonable I can do if
I have enough main memory to house/cache 10% of my index on each machine in
the cluster?


Thanks a ton,
Vijay

Optimal size of a segments sub-directory and a couple of other questions relating to Nutch response times

Reply via email to