[jira] Commented: (LUCENE-1282) Sun hotspot compiler bug in 1.6.0_04/05 affects Lucene

Michael McCandless (JIRA) Wed, 14 May 2008 14:42:29 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12596949#action_12596949
 ]


Michael McCandless commented on LUCENE-1282:
--------------------------------------------


I finally managed to reproduce this JVM bug, except, my case happens
while merging term vectors (mergeVectors) not stored fields as all
other cases seem to be.

I'm running JRE 1.6.0_05 on a Debian Linux box.

In my case, which just uses a modified contrib/benchmark to add 2000
wikipedia docs to a large index, I got to the point (when index was 19
GB) where every single time I run the benchmark alg, it hits an
exception.  Often the exception looks like this:

{code}
Exception in thread "Lucene Merge Thread #0" 
org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: read 
past EOF
        at 
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:323)
        at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:300)
Caused by: java.io.IOException: read past EOF
        at 
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:146)
        at 
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
        at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:68)
        at org.apache.lucene.store.IndexInput.readLong(IndexInput.java:91)
        at 
org.apache.lucene.index.TermVectorsReader.get(TermVectorsReader.java:345)
        at 
org.apache.lucene.index.SegmentReader.getTermFreqVectors(SegmentReader.java:992)
        at 
org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:441)
        at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:138)
        at 
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3998)
        at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3650)
        at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:214)
        at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:269)
{code}

I then added the same check that we now have for mergeFields, ie, to
verify the size of index file (_X.tvx) matches the number of
documents merged.  Sometimes, however, I see this different exception:

{code}
Exception in thread "Lucene Merge Thread #0" 
org.apache.lucene.index.MergePolicy$MergeException: 
java.lang.ArrayIndexOutOfBoundsException: 9375
        at 
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:323)
        at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:300)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 9375
        at 
org.apache.lucene.store.BufferedIndexOutput.writeByte(BufferedIndexOutput.java:36)
        at org.apache.lucene.store.IndexOutput.writeVInt(IndexOutput.java:71)
        at 
org.apache.lucene.index.TermVectorsWriter.addAllDocVectors(TermVectorsWriter.java:76)
        at 
org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:443)
        at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:138)
        at 
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3998)
        at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3650)
        at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:214)
        at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:269)
{code}

where the particular array index would vary all over the place.  This
is VERY odd because that array is the buffer in BufferedIndexOutput
and is always allocated to 16384 bytes so 9375 (and all others I saw)
is *not* out of bounds.

JRE 1.5.0_08 always runs fine.  Likewise running JRE 1.6.0_05 with
-Xint also runs fine.  However, JRE 1.6.0_05 with -Xbatch still hits
exceptions.

So then I started testing "trivial" modifications to the Java source
code in the mergeVectors, and found, insanely, that this simple diff
completely stopped the exceptions:

{code}
             } else {
-              
termVectorsWriter.addAllDocVectors(reader.getTermFreqVectors(docNum));
+              // NOTE: it's very important to first assign
+              // to vectors then pass it to
+              // termVectorsWriter.addAllDocVectors; see
+              // LUCENE-1282
+              TermFreqVector[] vectors = reader.getTermFreqVectors(docNum);
+              termVectorsWriter.addAllDocVectors(vectors);
{code}

(Ie, just forcing an assignment to a local variable).

It's crazy that such a trivial mod actually makes a difference in this
JRE bug (I would have expected it to be optimized away fairly early on
in compilation), but, I'm quite sure that diff resolves at least the
exceptions I've been seeing.  So I plan to commit this JRE bug
workaround to 2.4 & 2.3 branch.

I still haven't been able to hit the JRE bug when merging stored
fields, but, I'm still making that same corresponding mod to
mergeFields.


> Sun hotspot compiler bug in 1.6.0_04/05 affects Lucene
> ------------------------------------------------------
>
>                 Key: LUCENE-1282
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1282
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.3, 2.3.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: corrupt_merge_out15.txt
>
>
> This is not a Lucene bug.  It's an as-yet not fully characterized Sun
> JRE bug, as best I can tell.  I'm opening this to gather all things we
> know, and to work around it in Lucene if possible, and maybe open an
> issue with Sun if we can reduce it to a compact test case.
> It's hit at least 3 users:
>   
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200803.mbox/[EMAIL 
> PROTECTED]
>   
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200804.mbox/[EMAIL 
> PROTECTED]
>   
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200805.mbox/[EMAIL 
> PROTECTED]
> It's specific to at least JRE 1.6.0_04 and 1.6.0_05, that affects
> Lucene.  Whereas 1.6.0_03 works OK and it's unknown whether 1.6.0_06
> shows it.
> The bug affects bulk merging of stored fields.  When it strikes, the
> segment produced by a merge is corrupt because its fdx file (stored
> fields index file) is missing one document.  After iterating many
> times with the first user that hit this, adding diagnostics &
> assertions, its seems that a call to fieldsWriter.addDocument some
> either fails to run entirely, or, fails to invoke its call to
> indexStream.writeLong.  It's as if when hotspot compiles a method,
> there's some sort of race condition in cutting over to the compiled
> code whereby a single method call fails to be invoked (speculation).
> Unfortunately, this corruption is silent when it occurs and only later
> detected when a merge tries to merge the bad segment, or an
> IndexReader tries to open it.  Here's a typical merge exception:
> {code}
> Exception in thread "Thread-10" 
> org.apache.lucene.index.MergePolicy$MergeException: 
> org.apache.lucene.index.CorruptIndexException:
>     doc counts differ for segment _3gh: fieldsReader shows 15999 but 
> segmentInfo shows 16000
>         at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:271)
> Caused by: org.apache.lucene.index.CorruptIndexException: doc counts differ 
> for segment _3gh: fieldsReader shows 15999 but segmentInfo shows 16000
>         at 
> org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:313)
>         at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:262)
>         at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:221)
>         at 
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3099)
>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2834)
>         at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:240)
> {code}
> and here's a typical exception hit when opening a searcher:
> {code}
> org.apache.lucene.index.CorruptIndexException: doc counts differ for segment 
> _kk: fieldsReader shows 72670 but segmentInfo shows 72671
>         at 
> org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:313)
>         at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:262)
>         at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:230)
>         at 
> org.apache.lucene.index.DirectoryIndexReader$1.doBody(DirectoryIndexReader.java:73)
>         at 
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:636)
>         at 
> org.apache.lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:63)
>         at org.apache.lucene.index.IndexReader.open(IndexReader.java:209)
>         at org.apache.lucene.index.IndexReader.open(IndexReader.java:173)
>         at 
> org.apache.lucene.search.IndexSearcher.<init>(IndexSearcher.java:48)
> {code}
> Sometimes, adding -Xbatch (forces up front compilation) or -Xint
> (disables compilation) to the java command line works around the
> issue.
> Here are some of the OS's we've seen the failure on:
> {code}
> SuSE 10.0
> Linux phoebe 2.6.13-15-smp #1 SMP Tue Sep 13 14:56:15 UTC 2005 x86_64 
> x86_64 x86_64 GNU/Linux 
> SuSE 8.2
> Linux phobos 2.4.20-64GB-SMP #1 SMP Mon Mar 17 17:56:03 UTC 2003 i686 
> unknown unknown GNU/Linux 
> Red Hat Enterprise Linux Server release 5.1 (Tikanga)
> Linux lab8.betech.virginia.edu 2.6.18-53.1.14.el5 #1 SMP Tue Feb 19 
> 07:18:21 EST 2008 i686 i686 i386 GNU/Linux
> {code}
> I've already added assertions to Lucene to detect when this bug
> strikes, but since assertions are not usually enabled, I plan to add a
> real check to catch when this bug strikes *before* we commit the merge
> to the index.  This way we can detect & quarantine the failure and
> prevent corruption from entering the index.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1282) Sun hotspot compiler bug in 1.6.0_04/05 affects Lucene

Reply via email to