Re: scalability limits getDetails, mapFile Readers?

Stefan Groschupf Wed, 01 Mar 2006 17:57:48 -0800


Am Mar 2, 2006 um 2:06 AM schrieb Ken Krugler:

* merge 80 segments into 1. A lot of IO involved... and you haveto repeat it from time to time. Ugly.
I agree.
* implement a search server as a map task. Several challenges: itneeds to partition the Lucene index, and it has to copy all partsof segments and indexes from DFS to the local storage, otherwiseperformance will suffer. However, the number of open files permachine would be reduced, because (ideally) each machine woulddeal with few or a single part of segment and a single part ofindex...
Well I played around and already had a kind of prototype.
I had seen following problems:

+ having a kind of repository of active search servers
possibility A: find all tasktrackers running a specific task(already discussed in the hadoop mailing list)possibility B: having a rpc server running in the jvm that runsthe search server client, add the hostname to the jobconf andsimilar to task - jobtracker search server announce itself viahardbeat to the search server 'repository'.
+ having the index locally and the segment in the dfs.
++ adding to NutchBean init a dfs for index and one for segmentscould fix this, or more general add support for streamhandlerslike dfs:// vs file://. (very long term)
This seems most interesting to me, as working around issues of adistributed index (while still getting reasonable performance) seemtricky.

Yes searching from ndfs is too slow since all the io stream callsneeds to go over the network.

But being able to build a local index from NDFS, and access thesegment data via NDFS when needed for summaries & cached fileswould seem fairly straightforward.

I tried it and the concept general works it is more a problem ofgetting things managed.For example you can not manage which tasktracker gets the searchserver job. Partions or may better call it tasktracker groups ala"allBoxesWithMuchMemory" or "boxesWithMuchCPU" are not supported yet.Also the boot up time of such a map runnable search server is verylong since it need to copy or index data fist from ndfs segments.

Though every time I think I understand Nutch I'm wrong - or thecode changes :)
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"


---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com

Re: scalability limits getDetails, mapFile Readers?

Reply via email to