I agree, memory profiler or heap dump or small test case is the next step... the code looks fine.
This is always a single thread adding docs? Are you really certain that the iterator only iterates over 2500 docs? What analyzer are you using? Mike On Thu, Mar 4, 2010 at 4:50 AM, Ian Lea <ian....@gmail.com> wrote: > Have you run it through a memory profiler yet? Seems the obvious next step. > > If that doesn't help, cut it down to the simplest possible > self-contained program that demonstrates the problem and post it here. > > > -- > Ian. > > > On Thu, Mar 4, 2010 at 6:04 AM, ajay_gupta <ajay...@gmail.com> wrote: >> >> Erick, >> w_context and context_str are local to this method and are used only for >> 2500 K documents not entire 70 k. I am clearing the hashmap after each 2500k >> doc processing and also I printed memory consumed by hashmap which is kind >> of constant for each chunk processing. For each invocation of >> update_context memory should be kind of constant but after each invocation >> it increase few MB's and after 70k it goes OOM so something wrong is >> happening inside update_context some operation like search/update/add >> document is creating some memory and which is not release after returning >> from this method. >> >> -Ajay >> >> >> Erick Erickson wrote: >>> >>> The first place I'd look is how big my your strings >>> got. w_context and context_str come to mind. My >>> first suspicion is that you're building ever-longer >>> strings and around 70K documents your strings >>> are large enough to produce OOMs. >>> >>> FWIW >>> Erick >>> >>> On Wed, Mar 3, 2010 at 1:09 PM, ajay_gupta <ajay...@gmail.com> wrote: >>> >>>> >>>> Mike, >>>> Actually my documents are very small in size. We have csv files where >>>> each >>>> record represents a document which is not very large so I don't think >>>> document size is an issue. >>>> For each record I am tokenizing it and for each token I am keeping 3 >>>> neighbouring tokens in a Hashtable. After X number of documents where X >>>> is >>>> currently 2500 I am creating >>>> index by following code: >>>> >>>> //Initialization step done only at >>>> starting >>>> >>>> cram = FSDirectory.open(new >>>> File("lucenetemp2")); >>>> context_writer = new IndexWriter(cram, >>>> analyzer, true, >>>> IndexWriter.MaxFieldLength.LIMITED); >>>> >>>> // After each 2500 docs >>>> >>>> update_context() >>>> { >>>> context_writer.commit(); >>>> context_writer.optimize(); >>>> >>>> IndexSearcher is = new IndexSearcher(cram); >>>> IndexReader ir = is.getIndexReader(); >>>> Iterator<String> it = context.keySet().iterator(); >>>> >>>> while(it.hasNext()) >>>> { >>>> String word = it.next(); >>>> // This is all the context of "word" for >>>> all >>>> the 2500 docs >>>> StringBuffer w_context = >>>> context.get(word); >>>> Term t = new Term("Word", word); >>>> TermQuery tq = new TermQuery(t); >>>> TopScoreDocCollector collector = >>>> TopScoreDocCollector.create(1, false); >>>> is.search(tq,collector); >>>> ScoreDoc[] hits = >>>> collector.topDocs().scoreDocs; >>>> >>>> if(hits.length!=0) >>>> { >>>> int id = hits[0].doc; >>>> TermFreqVector tfv = >>>> ir.getTermFreqVector(id, "Context"); >>>> >>>> // This creates context string >>>> from >>>> TermFreqVector. For e.g if >>>> TermFreqVector is word1(2), word2(1),word3(2) then its output is >>>> // context_str="word1 word1 word2 >>>> word3 word3" >>>> String context_str = >>>> getContextString(tfv); >>>> >>>> >>>> w_context.append(context_str); >>>> Document new_doc = new Document(); >>>> new_doc.add(new Field("Word", >>>> word,Field.Store.YES, >>>> Field.Index.NOT_ANALYZED)); >>>> new_doc.add(new Field("Context", >>>> w_context.toString(),Field.Store.YES, >>>> Field.Index.ANALYZED, Field.TermVector.YES)); >>>> context_writer.updateDocument(t, >>>> new_doc); >>>> >>>> }else{ >>>> >>>> Document new_doc = new Document(); >>>> new_doc.add(new Field("Word", >>>> word,Field.Store.YES, >>>> Field.Index.NOT_ANALYZED)); >>>> new_doc.add(new Field("Context", >>>> w_context.toString(),Field.Store.YES, >>>> Field.Index.ANALYZED, Field.TermVector.YES)); >>>> >>>> context_writer.addDocument(new_doc); >>>> >>>> } >>>> } >>>> ir.close(); >>>> is.close(); >>>> >>>> } >>>> >>>> >>>> I am printing memory also after each invocation of this method and I >>>> observed that after each call of update_context memory increases and when >>>> it >>>> reaches around 65-70k it goes outofmemory so somewhere memory is >>>> increasing >>>> in each invocation. I thought each invocation should take constant amount >>>> of >>>> memory and it should not be increased cumulatively. Also after each >>>> invocation of Update_context I am also calling System.gc() to release >>>> memory >>>> and I also tried various other parameters like >>>> context_writer.setMaxBufferedDocs() >>>> context_writer.setMaxMergeDocs() >>>> context_writer.setRAMBufferSizeMB() >>>> I set these parameters smaller values as well but nothing worked. >>>> >>>> Any hint will be very helpful. >>>> >>>> Thanks >>>> Ajay >>>> >>>> >>>> Michael McCandless-2 wrote: >>>> > >>>> > The worst case RAM usage for Lucene is a single doc with many unique >>>> > terms. Lucene allocates ~60 bytes per unique term (plus space to hold >>>> > that term's characters = 2 bytes per char). And, Lucene cannot flush >>>> > within one document -- it must flush after the doc has been fully >>>> > indexed. >>>> > >>>> > This past thread (also from Paul) delves into some of the details: >>>> > >>>> > http://lucene.markmail.org/thread/pbeidtepentm6mdn >>>> > >>>> > But it's not clear whether that is the issue affecting Ajay -- I think >>>> > more details about the docs, or, some code fragments, could help shed >>>> > light. >>>> > >>>> > Mike >>>> > >>>> > On Tue, Mar 2, 2010 at 8:47 AM, Murdoch, Paul <paul.b.murd...@saic.com> >>>> > wrote: >>>> >> Ajay, >>>> >> >>>> >> Here is another thread I started on the same issue. >>>> >> >>>> >> >>>> http://stackoverflow.com/questions/1362460/why-does-lucene-cause-oom-whe >>>> >> n-indexing-large-files >>>> >> >>>> >> Paul >>>> >> >>>> >> >>>> >> -----Original Message----- >>>> >> From: java-user-return-45254-paul.b.murdoch=saic....@lucene.apache.org >>>> >> [mailto:java-user-return-45254-PAUL.B.MURDOCH=saic.com@ >>>> lucene.apache.org >>>> >> ] On Behalf Of ajay_gupta >>>> >> Sent: Tuesday, March 02, 2010 8:28 AM >>>> >> To: java-user@lucene.apache.org >>>> >> Subject: Lucene Indexing out of memory >>>> >> >>>> >> >>>> >> Hi, >>>> >> It might be general question though but I couldn't find the answer >>>> yet. >>>> >> I >>>> >> have around 90k documents sizing around 350 MB. Each document contains >>>> a >>>> >> record which has some text content. For each word in this text I want >>>> to >>>> >> store context for that word and index it so I am reading each document >>>> >> and >>>> >> for each word in that document I am appending fixed number of >>>> >> surrounding >>>> >> words. To do that first I search in existing indices if this word >>>> >> already >>>> >> exist and if it is then I get the content and append the new context >>>> and >>>> >> update the document. In case no context exist I create a document with >>>> >> fields "word" and "context" and add these two fields with values as >>>> word >>>> >> value and context value. >>>> >> >>>> >> I tried this in RAM but after certain no of docs it gave out of memory >>>> >> error >>>> >> so I thought to use FSDirectory method but surprisingly after 70k >>>> >> documents >>>> >> it also gave OOM error. I have enough disk space but still I am >>>> getting >>>> >> this >>>> >> error.I am not sure even for disk based indexing why its giving this >>>> >> error. >>>> >> I thought disk based indexing will be slow but atleast it will be >>>> >> scalable. >>>> >> Could someone suggest what could be the issue ? >>>> >> >>>> >> Thanks >>>> >> Ajay >>>> >> -- >>>> >> View this message in context: >>>> >> >>>> http://old.nabble.com/Lucene-Indexing-out-of-memory-tp27755872p27755872 >>>> . >>>> >> html >>>> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >>>> >> >>>> >> >>>> >> --------------------------------------------------------------------- >>>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >> >>>> >> >>>> >> --------------------------------------------------------------------- >>>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >> >>>> >> >>>> > >>>> > --------------------------------------------------------------------- >>>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> > For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> > >>>> > >>>> > >>>> >>>> -- >>>> View this message in context: >>>> http://old.nabble.com/Lucene-Indexing-out-of-memory-tp27755872p27771637.html >>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>>> >>> >>> >> >> -- >> View this message in context: >> http://old.nabble.com/Lucene-Indexing-out-of-memory-tp27755872p27777206.html >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org