Re: Lucene 4.4.0 mergeSegments OutOfMemoryError

2013-10-08 Thread Michael van Rooyen
With forceMerge(1) throwing an OOM error, we switched to 
forceMergeDeletes() which worked for a while, but that is now also 
running out of memory.  As a result, I've turned all manner of forced 
merges off.


I'm more than a little apprehensive that if the OOM error can happen as 
part of a forced merge, then it may also be able to happen as part of 
normal merges as the index grows.  I'd be grateful if someone who's 
grokked the code for segment merges could shed some light on whether I'm 
worrying unnecessarily...


Thanks,
Michael.

On 2013/09/26 01:43 PM, Michael van Rooyen wrote:
Thanks for the suggestion Ian.  I switched the optimization to do 
forceMergeDeletes() instead of forceMerge(1) and it completed 
successfully, so we will use that instead.  At least then we're 
guaranteed to have no more than 10% of dead space in the index.


I love the videos on Mike's post - I've always thought that the Lucene 
segment/merge mechanism is such an elegant and efficient way of 
handling a dynamic index.


Michael.

On 2013/09/26 12:45 PM, Ian Lea wrote:

There's a blog posting from Mike McCandless  about merging at
http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html. 


  Not very recent but probably still relevant.

You could try IndexWrite.forceMergeDeletes() rather than
forceMerge(1).  Still costly but probably less so, and might complete!


--
Ian.


On Thu, Sep 26, 2013 at 11:25 AM, Michael van Rooyen 
mich...@loot.co.za wrote:

Yes, it happens as part of the early morning optimize, and yes, it's a
forceMerge(1) which I've disabled for now.

I haven't looked at the persistence mechanism for Lucene since 2.x, 
but if I
remember correctly, the deleted documents would stay in an index 
segment
until that segment was eventually merged.  Without forcing a merge 
(optimize
in old versions), the footprint on disk could be a multiple of the 
actual

space required for the live documents, and this would have an impact on
performance (the deleted documents would clutter the buffer cache).

Is this still the case?  I would have thought it good practice to 
force the

dead space out of an index periodically, but if the underlying storage
mechanism has changed and the current index files are more efficient at
housekeeping, this may no longer be necessary.

If someone could shed a little light on best practice for indexes where
documents are frequently updated (i.e. deleted and re-added), that 
would be

great.

Michael.


On 2013/09/26 11:43 AM, Ian Lea wrote:

Is this OOM happening as part of your early morning optimize or at
some other point?  By optimize do you mean IndexWriter.forceMerge(1)?
You really shouldn't have to use that. If the index grows forever
without it then something else is going on which you might wish to
report separately.


--
Ian.


On Wed, Sep 25, 2013 at 12:35 PM, Michael van Rooyen 
mich...@loot.co.za

wrote:
We've recently upgraded to Lucene 4.4.0 and mergeSegments now 
causes an

OOM
error.

As background, our index contains about 14 million documents (growing
slowly) and we process about 1 million updates per day. It's about 
8GB on
disk.  I'm not sure if the Lucene segments merge the way they used 
to in

the
early versions, but we've always optimized at 3am to get rid of dead
space
in the index, or otherwise it grows forever.

The mergeSegments was working under 4.3.1 but the index has grown
somewhat
on disk since then, probably due to a couple of added 
NumericDocValues

fields.  The java process is assigned about 3GB (the maximum, as it's
running on a 32 bit i686 Linux box), and it still goes OOM.

Any advice as to the possible cause and how to circumvent it would be
great.
Here's the stack trace:

org.apache.lucene.index.MergePolicy$MergeException:
java.lang.OutOfMemoryError: Java heap space

org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:545) 



org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518) 


Caused by: java.lang.OutOfMemoryError: Java heap space

org.apache.lucene.codecs.lucene42.Lucene42DocValuesProducer.loadNumeric(Lucene42DocValuesProducer.java:212) 



org.apache.lucene.codecs.lucene42.Lucene42DocValuesProducer.getNumeric(Lucene42DocValuesProducer.java:174) 



org.apache.lucene.index.SegmentCoreReaders.getNormValues(SegmentCoreReaders.java:301) 



org.apache.lucene.index.SegmentReader.getNormValues(SegmentReader.java:253) 

org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:215) 


org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772) 


org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)

org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405) 



org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482) 




Thanks,
Michael

Re: Lucene 4.4.0 mergeSegments OutOfMemoryError

2013-09-26 Thread Michael van Rooyen
Yes, it happens as part of the early morning optimize, and yes, it's a 
forceMerge(1) which I've disabled for now.


I haven't looked at the persistence mechanism for Lucene since 2.x, but 
if I remember correctly, the deleted documents would stay in an index 
segment until that segment was eventually merged.  Without forcing a 
merge (optimize in old versions), the footprint on disk could be a 
multiple of the actual space required for the live documents, and this 
would have an impact on performance (the deleted documents would clutter 
the buffer cache).


Is this still the case?  I would have thought it good practice to force 
the dead space out of an index periodically, but if the underlying 
storage mechanism has changed and the current index files are more 
efficient at housekeeping, this may no longer be necessary.


If someone could shed a little light on best practice for indexes where 
documents are frequently updated (i.e. deleted and re-added), that would 
be great.


Michael.


On 2013/09/26 11:43 AM, Ian Lea wrote:

Is this OOM happening as part of your early morning optimize or at
some other point?  By optimize do you mean IndexWriter.forceMerge(1)?
You really shouldn't have to use that. If the index grows forever
without it then something else is going on which you might wish to
report separately.


--
Ian.


On Wed, Sep 25, 2013 at 12:35 PM, Michael van Rooyen mich...@loot.co.za wrote:

We've recently upgraded to Lucene 4.4.0 and mergeSegments now causes an OOM
error.

As background, our index contains about 14 million documents (growing
slowly) and we process about 1 million updates per day. It's about 8GB on
disk.  I'm not sure if the Lucene segments merge the way they used to in the
early versions, but we've always optimized at 3am to get rid of dead space
in the index, or otherwise it grows forever.

The mergeSegments was working under 4.3.1 but the index has grown somewhat
on disk since then, probably due to a couple of added NumericDocValues
fields.  The java process is assigned about 3GB (the maximum, as it's
running on a 32 bit i686 Linux box), and it still goes OOM.

Any advice as to the possible cause and how to circumvent it would be great.
Here's the stack trace:

org.apache.lucene.index.MergePolicy$MergeException:
java.lang.OutOfMemoryError: Java heap space
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:545)
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518)
Caused by: java.lang.OutOfMemoryError: Java heap space
org.apache.lucene.codecs.lucene42.Lucene42DocValuesProducer.loadNumeric(Lucene42DocValuesProducer.java:212)
org.apache.lucene.codecs.lucene42.Lucene42DocValuesProducer.getNumeric(Lucene42DocValuesProducer.java:174)
org.apache.lucene.index.SegmentCoreReaders.getNormValues(SegmentCoreReaders.java:301)
org.apache.lucene.index.SegmentReader.getNormValues(SegmentReader.java:253)
org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:215)
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)


Thanks,
Michael.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 4.4.0 mergeSegments OutOfMemoryError

2013-09-26 Thread Michael van Rooyen
Thanks for clarifying Uwe.  I will keep the daily optimization turned 
off.  I may be wrong, but I would guess that if the OOM is happening as 
part of the forceMerge, then there may be a chance that it could also 
happen as a natural part of the index growth when big segments are 
merged.  If so, it might be worth looking into anyway. I suspect that it 
may have to do with the way that NumericDocValues fields are handled in 
the merge process, but again, this is just a stab in the dark...


Michael.

On 2013/09/26 12:38 PM, Uwe Schindler wrote:

Hi,

TieredMergePolicy, which is the default since around Lucene 3.2,  prefers 
merging segments with many deletions, so forceMerge(1) is not needed.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de



-Original Message-
From: Michael van Rooyen [mailto:mich...@loot.co.za]
Sent: Thursday, September 26, 2013 12:26 PM
To: java-user@lucene.apache.org
Cc: Ian Lea
Subject: Re: Lucene 4.4.0 mergeSegments OutOfMemoryError

Yes, it happens as part of the early morning optimize, and yes, it's a
forceMerge(1) which I've disabled for now.

I haven't looked at the persistence mechanism for Lucene since 2.x, but if I
remember correctly, the deleted documents would stay in an index segment
until that segment was eventually merged.  Without forcing a merge
(optimize in old versions), the footprint on disk could be a multiple of the
actual space required for the live documents, and this would have an impact
on performance (the deleted documents would clutter the buffer cache).

Is this still the case?  I would have thought it good practice to force the dead
space out of an index periodically, but if the underlying storage mechanism
has changed and the current index files are more efficient at housekeeping,
this may no longer be necessary.

If someone could shed a little light on best practice for indexes where
documents are frequently updated (i.e. deleted and re-added), that would
be great.

Michael.


On 2013/09/26 11:43 AM, Ian Lea wrote:

Is this OOM happening as part of your early morning optimize or at
some other point?  By optimize do you mean IndexWriter.forceMerge(1)?
You really shouldn't have to use that. If the index grows forever
without it then something else is going on which you might wish to
report separately.


--
Ian.


On Wed, Sep 25, 2013 at 12:35 PM, Michael van Rooyen

mich...@loot.co.za wrote:

We've recently upgraded to Lucene 4.4.0 and mergeSegments now

causes

an OOM error.

As background, our index contains about 14 million documents (growing
slowly) and we process about 1 million updates per day. It's about
8GB on disk.  I'm not sure if the Lucene segments merge the way they
used to in the early versions, but we've always optimized at 3am to
get rid of dead space in the index, or otherwise it grows forever.

The mergeSegments was working under 4.3.1 but the index has grown
somewhat on disk since then, probably due to a couple of added
NumericDocValues fields.  The java process is assigned about 3GB (the
maximum, as it's running on a 32 bit i686 Linux box), and it still goes OOM.

Any advice as to the possible cause and how to circumvent it would be

great.

Here's the stack trace:

org.apache.lucene.index.MergePolicy$MergeException:
java.lang.OutOfMemoryError: Java heap space


org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeExceptio
n

(ConcurrentMergeScheduler.java:545)


org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Co
nc

urrentMergeScheduler.java:518) Caused by:

java.lang.OutOfMemoryError:

Java heap space


org.apache.lucene.codecs.lucene42.Lucene42DocValuesProducer.loadNume
r

ic(Lucene42DocValuesProducer.java:212)


org.apache.lucene.codecs.lucene42.Lucene42DocValuesProducer.getNumeri

c(Lucene42DocValuesProducer.java:174)


org.apache.lucene.index.SegmentCoreReaders.getNormValues(SegmentCor
eR

eaders.java:301)


org.apache.lucene.index.SegmentReader.getNormValues(SegmentReader.j
av

a:253)


org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.jav
a:2

15)


org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772

)
org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)


org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(Concurrent
Me

rgeScheduler.java:405)


org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Co
nc

urrentMergeScheduler.java:482)


Thanks,
Michael.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Lucene 4.4.0 mergeSegments OutOfMemoryError

2013-09-25 Thread Michael van Rooyen
We've recently upgraded to Lucene 4.4.0 and mergeSegments now causes an 
OOM error.


As background, our index contains about 14 million documents (growing 
slowly) and we process about 1 million updates per day. It's about 8GB 
on disk.  I'm not sure if the Lucene segments merge the way they used to 
in the early versions, but we've always optimized at 3am to get rid of 
dead space in the index, or otherwise it grows forever.


The mergeSegments was working under 4.3.1 but the index has grown 
somewhat on disk since then, probably due to a couple of added 
NumericDocValues fields.  The java process is assigned about 3GB (the 
maximum, as it's running on a 32 bit i686 Linux box), and it still goes OOM.


Any advice as to the possible cause and how to circumvent it would be 
great.  Here's the stack trace:


org.apache.lucene.index.MergePolicy$MergeException: 
java.lang.OutOfMemoryError: Java heap space

org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:545)
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518)
Caused by: java.lang.OutOfMemoryError: Java heap space
org.apache.lucene.codecs.lucene42.Lucene42DocValuesProducer.loadNumeric(Lucene42DocValuesProducer.java:212)
org.apache.lucene.codecs.lucene42.Lucene42DocValuesProducer.getNumeric(Lucene42DocValuesProducer.java:174)
org.apache.lucene.index.SegmentCoreReaders.getNormValues(SegmentCoreReaders.java:301)
org.apache.lucene.index.SegmentReader.getNormValues(SegmentReader.java:253)
org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:215)
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)


Thanks,
Michael.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Document boosting and native ordering of results

2013-08-28 Thread Michael van Rooyen
Thanks Uwe!  I hadn't investigated DocValues fields, but they look like 
an exciting addition to Lucene and definitely what we need. The 
FunctionQuery / CustomScoreQuery would be a great solution, but there 
doesn't seem to be a ValueSource dedicated to DocValues fields and all 
the field-based value-sources I could find are based on access via the 
field cache.  One of the purposes of the DocValues fields (in my 
understanding) is to bypass the need for using the field cache.  Am I 
missing something?


On 2013/08/26 07:37 PM, Uwe Schindler wrote:

Hi,

This is still possible (in reality it was broken in Lucene version prior 4.0 if 
you refer to Document.setBoost() - see changelog/MIGRATE.txt): You have to add 
an additional DocValues field (a long or double numeric) and use a FunctionQuery / 
CustomScoreQuery to modify the score based on this value.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de



-Original Message-
From: Michael van Rooyen [mailto:mich...@loot.co.za]
Sent: Monday, August 26, 2013 6:39 PM
To: java-user@lucene.apache.org
Subject: Re: Document boosting and native ordering of results

Not sure if there are any thoughts on this.

It definitely makes sense to assign a rank to each document in the index, so
that all else being equal, documents are returned in order of rank.  This is
exactly what the page rank is in Google's index, and Google would be lost
without it.  This used to be possible in old versions of Lucene, but no longer.
Should this be posted as a feature request to the developers?

Thanks,
Michael.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Document boosting and native ordering of results

2013-08-26 Thread Michael van Rooyen

Not sure if there are any thoughts on this.

It definitely makes sense to assign a rank to each document in the 
index, so that all else being equal, documents are returned in order of 
rank.  This is exactly what the page rank is in Google's index, and 
Google would be lost without it.  This used to be possible in old 
versions of Lucene, but no longer.  Should this be posted as a feature 
request to the developers?


Thanks,
Michael.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Altering field info without building index from scratch

2013-08-26 Thread Michael van Rooyen

Hello.

We got the error:

java.lang.IllegalStateException: field xxx was indexed without 
position data; cannot run PhraseQuery


What I suspect is happening is that field xxx was first indexed as a 
StringField (untokenized), and subsequently changed to TextField 
(tokenized and analyzed).  Even though all the docs containing the field 
have been updated in the index, Lucene still sees this as a raw field.


Is there a way to change the meta data associated with a field without 
building the index from scratch?


Thanks,
Michael.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Document boosting and native ordering of results

2013-08-20 Thread Michael van Rooyen
Hello.  We've just upgraded to 4.3.1 from 2.9.2 and are having a problem 
with native ordering of search results.


We always want documents returned in order of rank, which for us is a 
float value that we assign to each document at index time. Rank depends 
in whether, for example, the item is in stock and how recent it is.  We 
also store the rank as a field in the index. We don't use Lucene's 
scoring system for ordering results at all.


In 2.9.2, we used to set the boost on the document (we encoded our rank 
to ensure nice distribution over float range that is ultimately encoded 
as a 1 byte norm), and all results were returned in rank order without 
using a sort.


In 4.3.1, the document level boost is gone and only fields can be 
boosted.  Some queries, like a MatchAllDocsQuery, don't seem to take 
field level boosts into account at all when ordering results.


Is there an easy way in Lucene 4 to set the natural order for results in 
the absence of an explicit sort?


Thanks!

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: If you could have one feature in Lucene...

2010-02-24 Thread Michael van Rooyen

On 2010/02/24 03:42 PM, Grant Ingersoll wrote:

What would it be?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

   
Stop words counting when in a phrase query (this is probably possible, 
but I'm not sure how :)


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



java.io.IOException: read past EOF since migration to 2.9.1

2010-02-17 Thread Michael van Rooyen

Hello all!

We've been using Lucene for a few years and it's worked without a 
murmur.  I recently upgraded from version 2.3.2 to 2.9.1.  We didn't 
need to make any code changes for the upgrade - apart from the 
deprecation warnings, the code compiled cleanly and 2.9.1 worked fine in 
testing.


Since going live a few days ago, however, we've twice had read past EOF 
exceptions.  The first time it happened, I checked the index and an 
error had crept into the deleted docs count on the main segment:


Segments file=segments_cefg numSegments=4 version=FORMAT_DIAGNOSTICS 
[Lucene 2.9]

  1 of 4: name=_abtf8 docCount=9710072
compound=true
hasProx=true
numFiles=2
size (MB)=4,254.56
has deletions [delFileName=_abtf8_df.del]
test: open reader.FAILED
WARNING: fixIndex() would remove reference to this segment; full 
exception:
java.lang.RuntimeException: delete count mismatch: info=263213 vs 
deletedDocs.count()=260032
at 
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:499)

at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)

I checked the logs for the our process that updates the index and there 
were no exceptions logged.  I then optimized the index and checked it 
again and it was all okay, so obviously the optimize / merge process is 
happy to work on an index where the deletions file is in error.


Today, we got the second read past EOF exception.  This time I checked 
the index again and no errors were detected.  I think that whatever 
error there was that led to the EOF exception was on a small segment 
file that got merged into a larger one as more updates were made, before 
I had time to check the index.


Does anyone have any ideas as to what could cause this, or what we could 
do to avoid it happening?  The stack trace for the EOF exception is below.


Thanks,
Michael.

Caused by: java.io.IOException: read past EOF

org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:245)

org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:157)

org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
org.apache.lucene.store.IndexInput.readInt(IndexInput.java:70)
org.apache.lucene.store.IndexInput.readLong(IndexInput.java:93)
org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:210)
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:948)

org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:506)
org.apache.lucene.index.IndexReader.document(IndexReader.java:947)


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: java.io.IOException: read past EOF since migration to 2.9.1

2010-02-17 Thread Michael van Rooyen

Toke Eskildsen wrote:

On Wed, 2010-02-17 at 15:18 +0100, Michael van Rooyen wrote:
  

I recently upgraded from version 2.3.2 to 2.9.1. [...]
Since going live a few days ago, however, we've twice had read past EOF 
exceptions.



The first thing to do is check the Java version. If you're using Sun JRE
1.6.0, you might have encountered a nasty bug in the JVM:
http://issues.apache.org/jira/browse/LUCENE-1282
  
We're still using 1.5.0_06, and have been using it for ages.  When doing 
these kind of updates, I tend to change only one component at a time.  
In this case, all our code and the JVM stayed the same and all that 
changed was Lucene 2.3.2 to 2.9.1, then the EOF errors started occurring...


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: java.io.IOException: read past EOF since migration to 2.9.1

2010-02-17 Thread Michael van Rooyen



Toke Eskildsen wrote:

On Wed, 2010-02-17 at 15:18 +0100, Michael van Rooyen wrote:

I recently upgraded from version 2.3.2 to 2.9.1. [...]
Since going live a few days ago, however, we've twice had read past 
EOF exceptions.


The first thing to do is check the Java version. If you're using Sun JRE
1.6.0, you might have encountered a nasty bug in the JVM:
http://issues.apache.org/jira/browse/LUCENE-1282
We're still using 1.5.0_06, and have been using it for ages.  When 
doing these kind of updates, I tend to change only one component at a 
time.  In this case, all our code and the JVM stayed the same and all 
that changed was Lucene 2.3.2 to 2.9.1, then the EOF errors started 
occurring...


Just looking at that bug report, the links provided show that those 
errors occurred with Lucene 2.3.x on JRE 1.6.0.  If it were that JRE bug 
causing the EOF problem, I would have expected to get errors while we 
were using 2.3.2, but we used that for over a year on the same JVM 
(1.5.0_06) without a single error.


What's interesting is that whatever error was in the index will 
disappear once the faulty segment is merged in the next cycle, so it's 
quite possible (and perhaps likely in a heavily updated index) that by 
the time the index is checked, the corruption will be gone.  In this 
case, it may be reported as an EOF exception on a good index, and I've 
seen a few of those in the mail archives.  I suspect that if the merge 
process were more pedantic in not merging segments unless they are 
completely without error (including the deleted docs counts), then a lot 
more users might start encountering this problem.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Index missing documents

2006-02-22 Thread Michael van Rooyen
I'm using Lucene 1.4.3, and maxBufferedDocs only appears to be in the new 
(unreleased?) version of IndexWriter in CVS.  Looking at the code though, 
setMaxBufferedDocs(n) just translates to minMergeDocs = n.  My index was 
constructed using the default minMergeDocs = 10, so somehow this doesn't 
seem to be the culprit that caused all 2 million+ documents to be missing 
from the crashed index.  It seems more likely that none of the index files 
were registered in Lucene's segements file.  Is there perhaps some other 
trigger that causes Lucene to register the indexes in the segments file, 
or is there some way of flushing the segments file every so often to ensure 
that it's list is up to date?  Thanks again for your assistance.


Michael.

- Original Message - 
From: Otis Gospodnetic [EMAIL PROTECTED]

To: java-user@lucene.apache.org
Sent: Monday, February 20, 2006 8:39 PM
Subject: Re: Index missing documents


No, using the same IndexWriter is the way to go.  If you want things to be 
written to disk more frequently, lower the maxBufferedDocs setting.  Go 
down to 1, if you want.  You'll use less memory (RAM), Documents will be 
written to disk without getting buffered in RAM, but the indexing process 
will be slower.


Otis




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Index missing documents

2006-02-19 Thread Michael van Rooyen
While building a large index, we had a power outage.  Over 2 million 
documents had been added, each document with up to about 20 fields.  The 
size of the index on disk is ~500MB.  When I started the process up again, I 
noticed that documents that should have been in the index were missing.  In 
retrospect, I think that Lucene was seeing the index as being completely 
empty (it now says there are 385 docs in the index, but all of those have 
been added since the power outage).  The size on disk is still ~500MB.  Does 
anyone have an idea what might cause the documents to dissappear, and what 
can be done to get them back?  Rebuilding takes a while at 100ms per 
document, but it's a bit more concerning if such a outage or crash could 
cause documents to mysteriously dissapear from the index...



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]