Stefan Groschupf wrote:
Hi,

We run into a problem with nutch using MapFileOutputFormat#getReaders and getEntry. In detail this happens until summary generation where we open for each segment as much readers as much parts (part-0000 to part-n) we have.
Having 80 tasktracker and 80 segments means:
80 x 80 x 4 (parseData, parseText, content, crawl). A search server also needs to open as much files as required for the index searcher.
So the problem is a FileNotFoundException, (Too many open files).

Opening and closing Readers for each Detail makes no sense. We may can limit the number of readers somehow and close the readers that wasn't used since the longest time. But I'm not that happy with this solution, so any thoughts how we can solve this problem in general?

I don't think we can reduce the number of open files in this case... The solutions that come to my mind are:

* merge 80 segments into 1. A lot of IO involved... and you have to repeat it from time to time. Ugly.

* implement a search server as a map task. Several challenges: it needs to partition the Lucene index, and it has to copy all parts of segments and indexes from DFS to the local storage, otherwise performance will suffer. However, the number of open files per machine would be reduced, because (ideally) each machine would deal with few or a single part of segment and a single part of index...

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to