Mike,
Actually my documents are very small in size. We have csv files where each
record represents a document which is not very large so I don't think
document size is an issue.
For each record I am tokenizing it and for each token I am keeping 3
neighbouring tokens in a Hashtable. After X number of documents where X is
currently 2500 I am creating
index by following code:
//Initialization step done only at starting
cram = FSDirectory.open(new
File("lucenetemp2"));
context_writer = new IndexWriter(cram,
analyzer, true,
IndexWriter.MaxFieldLength.LIMITED);
// After each 2500 docs
update_context()
{
context_writer.commit();
context_writer.optimize();
IndexSearcher is = new IndexSearcher(cram);
IndexReader ir = is.getIndexReader();
Iterator<String> it = context.keySet().iterator();
while(it.hasNext())
{
String word = it.next();
// This is all the context of "word" for all
the 2500 docs
StringBuffer w_context = context.get(word);
Term t = new Term("Word", word);
TermQuery tq = new TermQuery(t);
TopScoreDocCollector collector =
TopScoreDocCollector.create(1, false);
is.search(tq,collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
if(hits.length!=0)
{
int id = hits[0].doc;
TermFreqVector tfv =
ir.getTermFreqVector(id, "Context");
// This creates context string from
TermFreqVector. For e.g if
TermFreqVector is word1(2), word2(1),word3(2) then its output is
// context_str="word1 word1 word2 word3
word3"
String context_str =
getContextString(tfv);
w_context.append(context_str);
Document new_doc = new Document();
new_doc.add(new Field("Word",
word,Field.Store.YES,
Field.Index.NOT_ANALYZED));
new_doc.add(new Field("Context",
w_context.toString(),Field.Store.YES,
Field.Index.ANALYZED, Field.TermVector.YES));
context_writer.updateDocument(t,
new_doc);
}else{
Document new_doc = new Document();
new_doc.add(new Field("Word",
word,Field.Store.YES,
Field.Index.NOT_ANALYZED));
new_doc.add(new Field("Context",
w_context.toString(),Field.Store.YES,
Field.Index.ANALYZED, Field.TermVector.YES));
context_writer.addDocument(new_doc);
}
}
ir.close();
is.close();
}
I am printing memory also after each invocation of this method and I
observed that after each call of update_context memory increases and when it
reaches around 65-70k it goes outofmemory so somewhere memory is increasing
in each invocation. I thought each invocation should take constant amount of
memory and it should not be increased cumulatively. Also after each
invocation of Update_context I am also calling System.gc() to release memory
and I also tried various other parameters like
context_writer.setMaxBufferedDocs()
context_writer.setMaxMergeDocs()
context_writer.setRAMBufferSizeMB()
I set these parameters smaller values as well but nothing worked.
Any hint will be very helpful.
Thanks
Ajay
Michael McCandless-2 wrote:
>
> The worst case RAM usage for Lucene is a single doc with many unique
> terms. Lucene allocates ~60 bytes per unique term (plus space to hold
> that term's characters = 2 bytes per char). And, Lucene cannot flush
> within one document -- it must flush after the doc has been fully
> indexed.
>
> This past thread (also from Paul) delves into some of the details:
>
> http://lucene.markmail.org/thread/pbeidtepentm6mdn
>
> But it's not clear whether that is the issue affecting Ajay -- I think
> more details about the docs, or, some code fragments, could help shed
> light.
>
> Mike
>
> On Tue, Mar 2, 2010 at 8:47 AM, Murdoch, Paul <[email protected]>
> wrote:
>> Ajay,
>>
>> Here is another thread I started on the same issue.
>>
>> http://stackoverflow.com/questions/1362460/why-does-lucene-cause-oom-whe
>> n-indexing-large-files
>>
>> Paul
>>
>>
>> -----Original Message-----
>> From: [email protected]
>> [mailto:[email protected]
>> ] On Behalf Of ajay_gupta
>> Sent: Tuesday, March 02, 2010 8:28 AM
>> To: [email protected]
>> Subject: Lucene Indexing out of memory
>>
>>
>> Hi,
>> It might be general question though but I couldn't find the answer yet.
>> I
>> have around 90k documents sizing around 350 MB. Each document contains a
>> record which has some text content. For each word in this text I want to
>> store context for that word and index it so I am reading each document
>> and
>> for each word in that document I am appending fixed number of
>> surrounding
>> words. To do that first I search in existing indices if this word
>> already
>> exist and if it is then I get the content and append the new context and
>> update the document. In case no context exist I create a document with
>> fields "word" and "context" and add these two fields with values as word
>> value and context value.
>>
>> I tried this in RAM but after certain no of docs it gave out of memory
>> error
>> so I thought to use FSDirectory method but surprisingly after 70k
>> documents
>> it also gave OOM error. I have enough disk space but still I am getting
>> this
>> error.I am not sure even for disk based indexing why its giving this
>> error.
>> I thought disk based indexing will be slow but atleast it will be
>> scalable.
>> Could someone suggest what could be the issue ?
>>
>> Thanks
>> Ajay
>> --
>> View this message in context:
>> http://old.nabble.com/Lucene-Indexing-out-of-memory-tp27755872p27755872.
>> html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
>
--
View this message in context:
http://old.nabble.com/Lucene-Indexing-out-of-memory-tp27755872p27771637.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]