You don't mention what size cluster you have, but we use a relatively small
cluster and index hundreds of GB in an hour to few hours (depending on the
content and the size fo the cluster).  So your results are anomalous.

However, we wrote our own indexer.  The way it works is that documents are
given randomized shard numbers and all of the documents for a single shard
are indexed by a reduce function that produces a Lucene index on local disk
and copies it to the central system.  This is a very common idiom and I am
sure that we essentially copied sample code from somewhere (likely from the
Katta distribution).   We use a relatively large number of shards and don't
try to merge the shard indexes after indexing.

I haven't looked at the index contrib code lately so I can't comment
specifically on that.  It is quite possible that it isn't widely used.

My guess is that you have something very simple going awry.

Can you say more about your cluster size, type of machine, operating system,
what your average document size is, whether you are trying to merge indexes,
and whether you are using the RAMDirectory or the FSDirectory?

On Thu, Jul 9, 2009 at 8:26 AM, bhushan_mahale <
[email protected]> wrote:

> I am trying to create lucene indexes using the
> "contrib/index/hadoop-0.19.1-index.jar" provided by Hadoop.
> Since it can be executed in map-reduced manner, I expect it to process
> large data very fast.
> It processes small amount of data (< 5MB) very quickly.
>
> Now 5 GB of input data is provided; and the fun starts :)
>
> It goes out of memory.
>

Reply via email to