Hi Andrzej,
* merge 80 segments into 1. A lot of IO involved... and you have to
repeat it from time to time. Ugly.
I agree.
* implement a search server as a map task. Several challenges: it
needs to partition the Lucene index, and it has to copy all parts
of segments and indexes from DFS to the local storage, otherwise
performance will suffer. However, the number of open files per
machine would be reduced, because (ideally) each machine would deal
with few or a single part of segment and a single part of index...
Well I played around and already had a kind of prototype.
I had seen following problems:
+ having a kind of repository of active search servers
possibility A: find all tasktrackers running a specific task (already
discussed in the hadoop mailing list)
possibility B: having a rpc server running in the jvm that runs the
search server client, add the hostname to the jobconf and similar to
task - jobtracker search server announce itself via hardbeat to the
search server 'repository'.
+ having the index locally and the segment in the dfs.
++ adding to NutchBean init a dfs for index and one for segments
could fix this, or more general add support for streamhandlers like
dfs:// vs file://. (very long term)
+ downloading an index from dfs until the mapper starts or just index
the segment data to local hdd and let the mapper run for the next 30
days?
Stefan