Re: scalability limits getDetails, mapFile Readers?

Andrzej Bialecki Wed, 01 Mar 2006 15:46:08 -0800

Stefan Groschupf wrote:

Hi,
We run into a problem with nutch using MapFileOutputFormat#getReadersand getEntry.In detail this happens until summary generation where we open for eachsegment as much readers as much parts (part-0000 to part-n) we have.
Having 80 tasktracker and 80 segments means:
80 x 80 x 4 (parseData, parseText, content, crawl). A search serveralso needs to open as much files as required for the index searcher.
So the problem is a FileNotFoundException, (Too many open files).
Opening and closing Readers for each Detail makes no sense. We may canlimit the number of readers somehow and close the readers that wasn'tused since the longest time.But I'm not that happy with this solution, so any thoughts how we cansolve this problem in general?

I don't think we can reduce the number of open files in this case... Thesolutions that come to my mind are:

* merge 80 segments into 1. A lot of IO involved... and you have torepeat it from time to time. Ugly.

* implement a search server as a map task. Several challenges: it needsto partition the Lucene index, and it has to copy all parts of segmentsand indexes from DFS to the local storage, otherwise performance willsuffer. However, the number of open files per machine would be reduced,because (ideally) each machine would deal with few or a single part ofsegment and a single part of index...


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: scalability limits getDetails, mapFile Readers?

Reply via email to