Re: Lucene 4.4.0 mergeSegments OutOfMemoryError
With forceMerge(1) throwing an OOM error, we switched to forceMergeDeletes() which worked for a while, but that is now also running out of memory. As a result, I've turned all manner of forced merges off. I'm more than a little apprehensive that if the OOM error can happen as part of a forced merge, then it may also be able to happen as part of normal merges as the index grows. I'd be grateful if someone who's grokked the code for segment merges could shed some light on whether I'm worrying unnecessarily... Thanks, Michael. On 2013/09/26 01:43 PM, Michael van Rooyen wrote: Thanks for the suggestion Ian. I switched the optimization to do forceMergeDeletes() instead of forceMerge(1) and it completed successfully, so we will use that instead. At least then we're guaranteed to have no more than 10% of dead space in the index. I love the videos on Mike's post - I've always thought that the Lucene segment/merge mechanism is such an elegant and efficient way of handling a dynamic index. Michael. On 2013/09/26 12:45 PM, Ian Lea wrote: There's a blog posting from Mike McCandless about merging at http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html. Not very recent but probably still relevant. You could try IndexWrite.forceMergeDeletes() rather than forceMerge(1). Still costly but probably less so, and might complete! -- Ian. On Thu, Sep 26, 2013 at 11:25 AM, Michael van Rooyen mich...@loot.co.za wrote: Yes, it happens as part of the early morning optimize, and yes, it's a forceMerge(1) which I've disabled for now. I haven't looked at the persistence mechanism for Lucene since 2.x, but if I remember correctly, the deleted documents would stay in an index segment until that segment was eventually merged. Without forcing a merge (optimize in old versions), the footprint on disk could be a multiple of the actual space required for the live documents, and this would have an impact on performance (the deleted documents would clutter the buffer cache). Is this still the case? I would have thought it good practice to force the dead space out of an index periodically, but if the underlying storage mechanism has changed and the current index files are more efficient at housekeeping, this may no longer be necessary. If someone could shed a little light on best practice for indexes where documents are frequently updated (i.e. deleted and re-added), that would be great. Michael. On 2013/09/26 11:43 AM, Ian Lea wrote: Is this OOM happening as part of your early morning optimize or at some other point? By optimize do you mean IndexWriter.forceMerge(1)? You really shouldn't have to use that. If the index grows forever without it then something else is going on which you might wish to report separately. -- Ian. On Wed, Sep 25, 2013 at 12:35 PM, Michael van Rooyen mich...@loot.co.za wrote: We've recently upgraded to Lucene 4.4.0 and mergeSegments now causes an OOM error. As background, our index contains about 14 million documents (growing slowly) and we process about 1 million updates per day. It's about 8GB on disk. I'm not sure if the Lucene segments merge the way they used to in the early versions, but we've always optimized at 3am to get rid of dead space in the index, or otherwise it grows forever. The mergeSegments was working under 4.3.1 but the index has grown somewhat on disk since then, probably due to a couple of added NumericDocValues fields. The java process is assigned about 3GB (the maximum, as it's running on a 32 bit i686 Linux box), and it still goes OOM. Any advice as to the possible cause and how to circumvent it would be great. Here's the stack trace: org.apache.lucene.index.MergePolicy$MergeException: java.lang.OutOfMemoryError: Java heap space org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:545) org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518) Caused by: java.lang.OutOfMemoryError: Java heap space org.apache.lucene.codecs.lucene42.Lucene42DocValuesProducer.loadNumeric(Lucene42DocValuesProducer.java:212) org.apache.lucene.codecs.lucene42.Lucene42DocValuesProducer.getNumeric(Lucene42DocValuesProducer.java:174) org.apache.lucene.index.SegmentCoreReaders.getNormValues(SegmentCoreReaders.java:301) org.apache.lucene.index.SegmentReader.getNormValues(SegmentReader.java:253) org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:215) org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119) org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772) org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376) org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405) org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482) Thanks, Michael
Re: Lucene 4.4.0 mergeSegments OutOfMemoryError
Yes, it happens as part of the early morning optimize, and yes, it's a forceMerge(1) which I've disabled for now. I haven't looked at the persistence mechanism for Lucene since 2.x, but if I remember correctly, the deleted documents would stay in an index segment until that segment was eventually merged. Without forcing a merge (optimize in old versions), the footprint on disk could be a multiple of the actual space required for the live documents, and this would have an impact on performance (the deleted documents would clutter the buffer cache). Is this still the case? I would have thought it good practice to force the dead space out of an index periodically, but if the underlying storage mechanism has changed and the current index files are more efficient at housekeeping, this may no longer be necessary. If someone could shed a little light on best practice for indexes where documents are frequently updated (i.e. deleted and re-added), that would be great. Michael. On 2013/09/26 11:43 AM, Ian Lea wrote: Is this OOM happening as part of your early morning optimize or at some other point? By optimize do you mean IndexWriter.forceMerge(1)? You really shouldn't have to use that. If the index grows forever without it then something else is going on which you might wish to report separately. -- Ian. On Wed, Sep 25, 2013 at 12:35 PM, Michael van Rooyen mich...@loot.co.za wrote: We've recently upgraded to Lucene 4.4.0 and mergeSegments now causes an OOM error. As background, our index contains about 14 million documents (growing slowly) and we process about 1 million updates per day. It's about 8GB on disk. I'm not sure if the Lucene segments merge the way they used to in the early versions, but we've always optimized at 3am to get rid of dead space in the index, or otherwise it grows forever. The mergeSegments was working under 4.3.1 but the index has grown somewhat on disk since then, probably due to a couple of added NumericDocValues fields. The java process is assigned about 3GB (the maximum, as it's running on a 32 bit i686 Linux box), and it still goes OOM. Any advice as to the possible cause and how to circumvent it would be great. Here's the stack trace: org.apache.lucene.index.MergePolicy$MergeException: java.lang.OutOfMemoryError: Java heap space org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:545) org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518) Caused by: java.lang.OutOfMemoryError: Java heap space org.apache.lucene.codecs.lucene42.Lucene42DocValuesProducer.loadNumeric(Lucene42DocValuesProducer.java:212) org.apache.lucene.codecs.lucene42.Lucene42DocValuesProducer.getNumeric(Lucene42DocValuesProducer.java:174) org.apache.lucene.index.SegmentCoreReaders.getNormValues(SegmentCoreReaders.java:301) org.apache.lucene.index.SegmentReader.getNormValues(SegmentReader.java:253) org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:215) org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119) org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772) org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376) org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405) org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482) Thanks, Michael. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene 4.4.0 mergeSegments OutOfMemoryError
Thanks for clarifying Uwe. I will keep the daily optimization turned off. I may be wrong, but I would guess that if the OOM is happening as part of the forceMerge, then there may be a chance that it could also happen as a natural part of the index growth when big segments are merged. If so, it might be worth looking into anyway. I suspect that it may have to do with the way that NumericDocValues fields are handled in the merge process, but again, this is just a stab in the dark... Michael. On 2013/09/26 12:38 PM, Uwe Schindler wrote: Hi, TieredMergePolicy, which is the default since around Lucene 3.2, prefers merging segments with many deletions, so forceMerge(1) is not needed. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael van Rooyen [mailto:mich...@loot.co.za] Sent: Thursday, September 26, 2013 12:26 PM To: java-user@lucene.apache.org Cc: Ian Lea Subject: Re: Lucene 4.4.0 mergeSegments OutOfMemoryError Yes, it happens as part of the early morning optimize, and yes, it's a forceMerge(1) which I've disabled for now. I haven't looked at the persistence mechanism for Lucene since 2.x, but if I remember correctly, the deleted documents would stay in an index segment until that segment was eventually merged. Without forcing a merge (optimize in old versions), the footprint on disk could be a multiple of the actual space required for the live documents, and this would have an impact on performance (the deleted documents would clutter the buffer cache). Is this still the case? I would have thought it good practice to force the dead space out of an index periodically, but if the underlying storage mechanism has changed and the current index files are more efficient at housekeeping, this may no longer be necessary. If someone could shed a little light on best practice for indexes where documents are frequently updated (i.e. deleted and re-added), that would be great. Michael. On 2013/09/26 11:43 AM, Ian Lea wrote: Is this OOM happening as part of your early morning optimize or at some other point? By optimize do you mean IndexWriter.forceMerge(1)? You really shouldn't have to use that. If the index grows forever without it then something else is going on which you might wish to report separately. -- Ian. On Wed, Sep 25, 2013 at 12:35 PM, Michael van Rooyen mich...@loot.co.za wrote: We've recently upgraded to Lucene 4.4.0 and mergeSegments now causes an OOM error. As background, our index contains about 14 million documents (growing slowly) and we process about 1 million updates per day. It's about 8GB on disk. I'm not sure if the Lucene segments merge the way they used to in the early versions, but we've always optimized at 3am to get rid of dead space in the index, or otherwise it grows forever. The mergeSegments was working under 4.3.1 but the index has grown somewhat on disk since then, probably due to a couple of added NumericDocValues fields. The java process is assigned about 3GB (the maximum, as it's running on a 32 bit i686 Linux box), and it still goes OOM. Any advice as to the possible cause and how to circumvent it would be great. Here's the stack trace: org.apache.lucene.index.MergePolicy$MergeException: java.lang.OutOfMemoryError: Java heap space org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeExceptio n (ConcurrentMergeScheduler.java:545) org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Co nc urrentMergeScheduler.java:518) Caused by: java.lang.OutOfMemoryError: Java heap space org.apache.lucene.codecs.lucene42.Lucene42DocValuesProducer.loadNume r ic(Lucene42DocValuesProducer.java:212) org.apache.lucene.codecs.lucene42.Lucene42DocValuesProducer.getNumeri c(Lucene42DocValuesProducer.java:174) org.apache.lucene.index.SegmentCoreReaders.getNormValues(SegmentCor eR eaders.java:301) org.apache.lucene.index.SegmentReader.getNormValues(SegmentReader.j av a:253) org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.jav a:2 15) org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119) org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772 ) org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376) org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(Concurrent Me rgeScheduler.java:405) org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Co nc urrentMergeScheduler.java:482) Thanks, Michael. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Lucene 4.4.0 mergeSegments OutOfMemoryError
We've recently upgraded to Lucene 4.4.0 and mergeSegments now causes an OOM error. As background, our index contains about 14 million documents (growing slowly) and we process about 1 million updates per day. It's about 8GB on disk. I'm not sure if the Lucene segments merge the way they used to in the early versions, but we've always optimized at 3am to get rid of dead space in the index, or otherwise it grows forever. The mergeSegments was working under 4.3.1 but the index has grown somewhat on disk since then, probably due to a couple of added NumericDocValues fields. The java process is assigned about 3GB (the maximum, as it's running on a 32 bit i686 Linux box), and it still goes OOM. Any advice as to the possible cause and how to circumvent it would be great. Here's the stack trace: org.apache.lucene.index.MergePolicy$MergeException: java.lang.OutOfMemoryError: Java heap space org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:545) org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518) Caused by: java.lang.OutOfMemoryError: Java heap space org.apache.lucene.codecs.lucene42.Lucene42DocValuesProducer.loadNumeric(Lucene42DocValuesProducer.java:212) org.apache.lucene.codecs.lucene42.Lucene42DocValuesProducer.getNumeric(Lucene42DocValuesProducer.java:174) org.apache.lucene.index.SegmentCoreReaders.getNormValues(SegmentCoreReaders.java:301) org.apache.lucene.index.SegmentReader.getNormValues(SegmentReader.java:253) org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:215) org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119) org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772) org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376) org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405) org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482) Thanks, Michael. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Document boosting and native ordering of results
Thanks Uwe! I hadn't investigated DocValues fields, but they look like an exciting addition to Lucene and definitely what we need. The FunctionQuery / CustomScoreQuery would be a great solution, but there doesn't seem to be a ValueSource dedicated to DocValues fields and all the field-based value-sources I could find are based on access via the field cache. One of the purposes of the DocValues fields (in my understanding) is to bypass the need for using the field cache. Am I missing something? On 2013/08/26 07:37 PM, Uwe Schindler wrote: Hi, This is still possible (in reality it was broken in Lucene version prior 4.0 if you refer to Document.setBoost() - see changelog/MIGRATE.txt): You have to add an additional DocValues field (a long or double numeric) and use a FunctionQuery / CustomScoreQuery to modify the score based on this value. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael van Rooyen [mailto:mich...@loot.co.za] Sent: Monday, August 26, 2013 6:39 PM To: java-user@lucene.apache.org Subject: Re: Document boosting and native ordering of results Not sure if there are any thoughts on this. It definitely makes sense to assign a rank to each document in the index, so that all else being equal, documents are returned in order of rank. This is exactly what the page rank is in Google's index, and Google would be lost without it. This used to be possible in old versions of Lucene, but no longer. Should this be posted as a feature request to the developers? Thanks, Michael. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Document boosting and native ordering of results
Not sure if there are any thoughts on this. It definitely makes sense to assign a rank to each document in the index, so that all else being equal, documents are returned in order of rank. This is exactly what the page rank is in Google's index, and Google would be lost without it. This used to be possible in old versions of Lucene, but no longer. Should this be posted as a feature request to the developers? Thanks, Michael. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Altering field info without building index from scratch
Hello. We got the error: java.lang.IllegalStateException: field xxx was indexed without position data; cannot run PhraseQuery What I suspect is happening is that field xxx was first indexed as a StringField (untokenized), and subsequently changed to TextField (tokenized and analyzed). Even though all the docs containing the field have been updated in the index, Lucene still sees this as a raw field. Is there a way to change the meta data associated with a field without building the index from scratch? Thanks, Michael. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Document boosting and native ordering of results
Hello. We've just upgraded to 4.3.1 from 2.9.2 and are having a problem with native ordering of search results. We always want documents returned in order of rank, which for us is a float value that we assign to each document at index time. Rank depends in whether, for example, the item is in stock and how recent it is. We also store the rank as a field in the index. We don't use Lucene's scoring system for ordering results at all. In 2.9.2, we used to set the boost on the document (we encoded our rank to ensure nice distribution over float range that is ultimately encoded as a 1 byte norm), and all results were returned in rank order without using a sort. In 4.3.1, the document level boost is gone and only fields can be boosted. Some queries, like a MatchAllDocsQuery, don't seem to take field level boosts into account at all when ordering results. Is there an easy way in Lucene 4 to set the natural order for results in the absence of an explicit sort? Thanks! - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: If you could have one feature in Lucene...
On 2010/02/24 03:42 PM, Grant Ingersoll wrote: What would it be? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org Stop words counting when in a phrase query (this is probably possible, but I'm not sure how :) - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
java.io.IOException: read past EOF since migration to 2.9.1
Hello all! We've been using Lucene for a few years and it's worked without a murmur. I recently upgraded from version 2.3.2 to 2.9.1. We didn't need to make any code changes for the upgrade - apart from the deprecation warnings, the code compiled cleanly and 2.9.1 worked fine in testing. Since going live a few days ago, however, we've twice had read past EOF exceptions. The first time it happened, I checked the index and an error had crept into the deleted docs count on the main segment: Segments file=segments_cefg numSegments=4 version=FORMAT_DIAGNOSTICS [Lucene 2.9] 1 of 4: name=_abtf8 docCount=9710072 compound=true hasProx=true numFiles=2 size (MB)=4,254.56 has deletions [delFileName=_abtf8_df.del] test: open reader.FAILED WARNING: fixIndex() would remove reference to this segment; full exception: java.lang.RuntimeException: delete count mismatch: info=263213 vs deletedDocs.count()=260032 at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:499) at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903) I checked the logs for the our process that updates the index and there were no exceptions logged. I then optimized the index and checked it again and it was all okay, so obviously the optimize / merge process is happy to work on an index where the deletions file is in error. Today, we got the second read past EOF exception. This time I checked the index again and no errors were detected. I think that whatever error there was that led to the EOF exception was on a small segment file that got merged into a larger one as more updates were made, before I had time to check the index. Does anyone have any ideas as to what could cause this, or what we could do to avoid it happening? The stack trace for the EOF exception is below. Thanks, Michael. Caused by: java.io.IOException: read past EOF org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:245) org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:157) org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38) org.apache.lucene.store.IndexInput.readInt(IndexInput.java:70) org.apache.lucene.store.IndexInput.readLong(IndexInput.java:93) org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:210) org.apache.lucene.index.SegmentReader.document(SegmentReader.java:948) org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:506) org.apache.lucene.index.IndexReader.document(IndexReader.java:947) - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: java.io.IOException: read past EOF since migration to 2.9.1
Toke Eskildsen wrote: On Wed, 2010-02-17 at 15:18 +0100, Michael van Rooyen wrote: I recently upgraded from version 2.3.2 to 2.9.1. [...] Since going live a few days ago, however, we've twice had read past EOF exceptions. The first thing to do is check the Java version. If you're using Sun JRE 1.6.0, you might have encountered a nasty bug in the JVM: http://issues.apache.org/jira/browse/LUCENE-1282 We're still using 1.5.0_06, and have been using it for ages. When doing these kind of updates, I tend to change only one component at a time. In this case, all our code and the JVM stayed the same and all that changed was Lucene 2.3.2 to 2.9.1, then the EOF errors started occurring... - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: java.io.IOException: read past EOF since migration to 2.9.1
Toke Eskildsen wrote: On Wed, 2010-02-17 at 15:18 +0100, Michael van Rooyen wrote: I recently upgraded from version 2.3.2 to 2.9.1. [...] Since going live a few days ago, however, we've twice had read past EOF exceptions. The first thing to do is check the Java version. If you're using Sun JRE 1.6.0, you might have encountered a nasty bug in the JVM: http://issues.apache.org/jira/browse/LUCENE-1282 We're still using 1.5.0_06, and have been using it for ages. When doing these kind of updates, I tend to change only one component at a time. In this case, all our code and the JVM stayed the same and all that changed was Lucene 2.3.2 to 2.9.1, then the EOF errors started occurring... Just looking at that bug report, the links provided show that those errors occurred with Lucene 2.3.x on JRE 1.6.0. If it were that JRE bug causing the EOF problem, I would have expected to get errors while we were using 2.3.2, but we used that for over a year on the same JVM (1.5.0_06) without a single error. What's interesting is that whatever error was in the index will disappear once the faulty segment is merged in the next cycle, so it's quite possible (and perhaps likely in a heavily updated index) that by the time the index is checked, the corruption will be gone. In this case, it may be reported as an EOF exception on a good index, and I've seen a few of those in the mail archives. I suspect that if the merge process were more pedantic in not merging segments unless they are completely without error (including the deleted docs counts), then a lot more users might start encountering this problem. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Index missing documents
I'm using Lucene 1.4.3, and maxBufferedDocs only appears to be in the new (unreleased?) version of IndexWriter in CVS. Looking at the code though, setMaxBufferedDocs(n) just translates to minMergeDocs = n. My index was constructed using the default minMergeDocs = 10, so somehow this doesn't seem to be the culprit that caused all 2 million+ documents to be missing from the crashed index. It seems more likely that none of the index files were registered in Lucene's segements file. Is there perhaps some other trigger that causes Lucene to register the indexes in the segments file, or is there some way of flushing the segments file every so often to ensure that it's list is up to date? Thanks again for your assistance. Michael. - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Monday, February 20, 2006 8:39 PM Subject: Re: Index missing documents No, using the same IndexWriter is the way to go. If you want things to be written to disk more frequently, lower the maxBufferedDocs setting. Go down to 1, if you want. You'll use less memory (RAM), Documents will be written to disk without getting buffered in RAM, but the indexing process will be slower. Otis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Index missing documents
While building a large index, we had a power outage. Over 2 million documents had been added, each document with up to about 20 fields. The size of the index on disk is ~500MB. When I started the process up again, I noticed that documents that should have been in the index were missing. In retrospect, I think that Lucene was seeing the index as being completely empty (it now says there are 385 docs in the index, but all of those have been added since the power outage). The size on disk is still ~500MB. Does anyone have an idea what might cause the documents to dissappear, and what can be done to get them back? Rebuilding takes a while at 100ms per document, but it's a bit more concerning if such a outage or crash could cause documents to mysteriously dissapear from the index... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]