You don't mention what size cluster you have, but we use a relatively small
cluster and index hundreds of GB in an hour to few hours (depending on the
content and the size fo the cluster). So your results are anomalous.
However, we wrote our own indexer. The way it works is that documents are
given randomized shard numbers and all of the documents for a single shard
are indexed by a reduce function that produces a Lucene index on local disk
and copies it to the central system. This is a very common idiom and I am
sure that we essentially copied sample code from somewhere (likely from the
Katta distribution). We use a relatively large number of shards and don't
try to merge the shard indexes after indexing.
FWIW, at Krugle a major performance bottleneck with index creation
was the merge time. We had expected newer versions of Lucene to
dramatically reduce the time required to do merges, but that didn't
happen, even with a few stabs at tuning various parameters.
So - if you can avoid merging, by using a reasonable number of
shards, that can significantly reduce the total time required to
build the index.
And as for the actual index generation, with the Bixo project we're
using some Katta-derived code...it's a Cascading Scheme (called
IndexScheme) that generates an index. Pretty straightforward, other
than needing to call Hadoop's reporter via a thread so the task
doesn't time out during a long Lucene optimize.
We wind up with one index (shard) per reducer, so by controlling the
number of reducers we can vary the shard count, down to a minimum
count == the number of slaves in the processing cluster.
-- Ken
I haven't looked at the index contrib code lately so I can't comment
specifically on that. It is quite possible that it isn't widely used.
My guess is that you have something very simple going awry.
Can you say more about your cluster size, type of machine, operating system,
what your average document size is, whether you are trying to merge indexes,
and whether you are using the RAMDirectory or the FSDirectory?
On Thu, Jul 9, 2009 at 8:26 AM, bhushan_mahale <
[email protected]> wrote:
I am trying to create lucene indexes using the
"contrib/index/hadoop-0.19.1-index.jar" provided by Hadoop.
Since it can be executed in map-reduced manner, I expect it to process
large data very fast.
It processes small amount of data (< 5MB) very quickly.
Now 5 GB of input data is provided; and the fun starts :)
It goes out of memory.
--
Ken Krugler
+1 530-210-6378