Stefan Groschupf wrote:
Hi,
We run into a problem with nutch using MapFileOutputFormat#getReaders
and getEntry.
In detail this happens until summary generation where we open for each
segment as much readers as much parts (part-0000 to part-n) we have.
Having 80 tasktracker and 80 segments means:
80 x 80 x 4 (parseData, parseText, content, crawl). A search server
also needs to open as much files as required for the index searcher.
So the problem is a FileNotFoundException, (Too many open files).
Opening and closing Readers for each Detail makes no sense. We may can
limit the number of readers somehow and close the readers that wasn't
used since the longest time.
But I'm not that happy with this solution, so any thoughts how we can
solve this problem in general?
I don't think we can reduce the number of open files in this case... The
solutions that come to my mind are:
* merge 80 segments into 1. A lot of IO involved... and you have to
repeat it from time to time. Ugly.
* implement a search server as a map task. Several challenges: it needs
to partition the Lucene index, and it has to copy all parts of segments
and indexes from DFS to the local storage, otherwise performance will
suffer. However, the number of open files per machine would be reduced,
because (ideally) each machine would deal with few or a single part of
segment and a single part of index...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com