Re: Lucene index creation using Hadoop

Ken Krugler Thu, 09 Jul 2009 09:22:15 -0700

You don't mention what size cluster you have, but we use a relatively small
cluster and index hundreds of GB in an hour to few hours (depending on the
content and the size fo the cluster).  So your results are anomalous.


However, we wrote our own indexer.  The way it works is that documents are
given randomized shard numbers and all of the documents for a single shard
are indexed by a reduce function that produces a Lucene index on local disk
and copies it to the central system.  This is a very common idiom and I am
sure that we essentially copied sample code from somewhere (likely from the
Katta distribution).   We use a relatively large number of shards and don't
try to merge the shard indexes after indexing.

FWIW, at Krugle a major performance bottleneck with index creationwas the merge time. We had expected newer versions of Lucene todramatically reduce the time required to do merges, but that didn'thappen, even with a few stabs at tuning various parameters.

So - if you can avoid merging, by using a reasonable number ofshards, that can significantly reduce the total time required tobuild the index.

And as for the actual index generation, with the Bixo project we'reusing some Katta-derived code...it's a Cascading Scheme (calledIndexScheme) that generates an index. Pretty straightforward, otherthan needing to call Hadoop's reporter via a thread so the taskdoesn't time out during a long Lucene optimize.

We wind up with one index (shard) per reducer, so by controlling thenumber of reducers we can vary the shard count, down to a minimumcount == the number of slaves in the processing cluster.


-- Ken

I haven't looked at the index contrib code lately so I can't comment
specifically on that.  It is quite possible that it isn't widely used.

My guess is that you have something very simple going awry.

Can you say more about your cluster size, type of machine, operating system,
what your average document size is, whether you are trying to merge indexes,
and whether you are using the RAMDirectory or the FSDirectory?

On Thu, Jul 9, 2009 at 8:26 AM, bhushan_mahale <
[email protected]> wrote:

 I am trying to create lucene indexes using the
 "contrib/index/hadoop-0.19.1-index.jar" provided by Hadoop.
 Since it can be executed in map-reduced manner, I expect it to process
 large data very fast.
 It processes small amount of data (< 5MB) very quickly.

 Now 5 GB of input data is provided; and the fun starts :)

 It goes out of memory.



--
Ken Krugler
+1 530-210-6378

Re: Lucene index creation using Hadoop

Reply via email to