Hi Ian,

Can you try this patch? Grasping at straws at this point, trying to isolate where the JVM fails us...

I'm CC'ing java-dev. To sum up: sometimes when we merge fields, the fdx file ends up exactly one document too short. In adding numerous asserts around this code in SegmentMerger.java, insanely, somehow the call to indexStream.writeLong fails to actually "happen" when we call FieldsWriter.addDocument, even though from looking at the code I see no way to explain that. As hard as it is to believe, it really is looking like a strange JVM bug at this point...

Has anyone else seen any odd behavior on update 4 or 5 of JDK 1.6? The issue does not happen on previous updates of JDK 1.6.

Mike



Ian Lea wrote:
I agree that it's spooky.  Took quite a while to convince myself.

No interesting JVM options at all. $ java -ea -cp whatever classname args

Latest test has just failed, a bit quicker this time.

Exception in thread "Thread-0"
org.apache.lucene.index.MergePolicy$MergeException:
java.lang.AssertionError: out.bytesWritten = 347400 vs docCount =
43426
at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run (ConcurrentMergeScheduler.java:271)
Caused by: java.lang.AssertionError: out.bytesWritten = 347400 vs
docCount = 43426
at org.apache.lucene.index.SegmentMerger.mergeFields (SegmentMerger.java:334) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java: 136) at org.apache.lucene.index.IndexWriter.mergeMiddle (IndexWriter.java:3257)
        at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2952)
at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run (ConcurrentMergeScheduler.java:240)



--
Ian.

On Wed, Mar 19, 2008 at 2:20 PM, Michael McCandless
<[EMAIL PROTECTED]> wrote:

Well I'm starting to believe you, but still it's spooky. It's really as if a method doesn't get called, very rarely, when Lucene calls it.

 Are you running with any interesting runtime JVM (-server -Xbatch
 etc) options?

 When you first upgraded to 1.6.0_04 and _05, what version of Lucene
 were you using at the time?  Like, you didn't also upgrade Lucene
 simultaneously?

 Mike



 Ian Lea wrote:
Latest patch applied and test started.


I'm confident that the problem only happens on 1.6.0_04 and _05.

We take extracts from Oracle and load different subsets into different
lucene indexes on 3 different servers.  This has been running for
years,
literally, on different versions of the JVM and of lucene. Obviously
with occasional mods, but basically unchanged.  Last Friday I
upgraded 2
of the servers to 1.6.0_05 and on one of them (phoebe) fired off
jobs to
load a new full extract into new indexes. And they failed with these CorruptIndexExceptions. Initially I thought it must be a problem with
my new extract, since that was all that had changed, but eventually
discovered that an unchanged job, reading extract data in old format,
had failed with similar error.  And that was on the other server
upgraded to 1.6.0_05.  The equivalent job on the non-upgraded server
didn't fail.  So downgraded phoebe to 1.6.0_03, reran the load jobs
and
they worked.  Since then several other load jobs have also run to
completion (none have failed) on 1.6.0_03. On the other hand I've
yet to
see a job run to completion on 1.6.0_05.

--
Ian.

On Wed, Mar 19, 2008 at 1:31 PM, Michael McCandless
<[EMAIL PROTECTED]> wrote:

 OK try this patch?  The bug doesn't have much more space to hide!

 Are you quite about this not happening on certain versions of the
 JVM?  Because I'm having a hard time seeing where the bug could
be in
 Lucene.

 Mike




 Ian Lea wrote:
Failed eventually, with


Exception in thread "Thread-0"
org.apache.lucene.index.MergePolicy$MergeException:
java.lang.AssertionError: out.bytesWritten = 570264 vs docCount =
71284
      at org.apache.lucene.index.ConcurrentMergeScheduler
$MergeThread.run
(ConcurrentMergeScheduler.java:271)
Caused by: java.lang.AssertionError: out.bytesWritten = 570264 vs
docCount = 71284
      at org.apache.lucene.index.SegmentMerger.mergeFields
(SegmentMerger.java:340)
      at org.apache.lucene.index.SegmentMerger.merge
(SegmentMerger.java:
133)
      at org.apache.lucene.index.IndexWriter.mergeMiddle
(IndexWriter.java:3257)
      at org.apache.lucene.index.IndexWriter.merge
(IndexWriter.java:2952)
      at org.apache.lucene.index.ConcurrentMergeScheduler
$MergeThread.run
(ConcurrentMergeScheduler.java:240)


--
Ian.

On Wed, Mar 19, 2008 at 9:25 AM, Michael McCandless
<[EMAIL PROTECTED]> wrote:

 Ian,

 Good morning!

 OK try this new patch.  It just adds further asserts deeper in
Lucene.

 Mike




 Ian Lea wrote:
OK.  Latest Exception:

Exception in thread "Thread-0"
org.apache.lucene.index.MergePolicy$MergeException:
java.lang.AssertionError: after mergeFields: fdx size mismatch:
67871
docs vs 542960 length in bytes of _4l.fdx
        at org.apache.lucene.index.ConcurrentMergeScheduler
$MergeThread.run(ConcurrentMergeScheduler.java:271)
Caused by: java.lang.AssertionError: after mergeFields: fdx size
mismatch: 67871 docs vs 542960 length in bytes of _4l.fdx
        at org.apache.lucene.index.SegmentMerger.mergeFields
(SegmentMerger.java:339)
        at org.apache.lucene.index.SegmentMerger.merge
(SegmentMerger.java:133)
        at org.apache.lucene.index.IndexWriter.mergeMiddle
(IndexWriter.java:3257)
        at org.apache.lucene.index.IndexWriter.merge
(IndexWriter.java:2952)
        at org.apache.lucene.index.ConcurrentMergeScheduler
$MergeThread.run(ConcurrentMergeScheduler.java:240

and infostream attached.

I don't know where you are in the world, but it's getting late
here in
the UK so that's me done for tonight.

Thanks again for the help.


--
Ian.


On Tue, Mar 18, 2008 at 9:08 PM, Michael McCandless
<[EMAIL PROTECTED]> wrote:

 Oh that is the wrong version.  That's "trunk" (upcoming 2.4).

 But you had been seeing the issue on 2.3.1 right?

 Can you do this checkout:

     svn checkout https://svn.apache.org/repos/asf/lucene/java/
 branches/lucene_2_3/ lucene23

And then apply the patch and then get the issue to happen, with
 asserts & infoStream?  Thanks.

 Still, it is useful that you also get it to happen on trunk.

 You're doing great!

 Mike


 On Mar 18, 2008, at 5:02 PM, Ian Lea wrote:

Patched version attached. Patches are being applied to the tree
downloaded earlier today by
$ svn checkout http://svn.apache.org/repos/asf/lucene/java/ trunk
as requested by Yonik for the TestStress thingy.

Is that the right version?  Sorry if not - I'm a lucene user,
not
developer!


--
Ian.


On Tue, Mar 18, 2008 at 8:53 PM, Michael McCandless
<[EMAIL PROTECTED]> wrote:

 Ian can you attach your version of SegmentMerger.java?
Somehow my
 lines are off from yours.



 Mike

 Ian Lea wrote:
Mike


Latest patch produces similar exception:

Exception in thread "Lucene Merge Thread #0"
org.apache.lucene.index.MergePolicy$MergeException:
java.lang.AssertionError: after mergeFields: fdx size
mismatch:
65184
docs vs 521464 length in bytes of _c9.fdx
        at




org.apache.lucene.index.ConcurrentMergeScheduler.handleMerge Ex
ce
pt
io
n(
ConcurrentMergeScheduler.java:320)
        at org.apache.lucene.index.ConcurrentMergeScheduler
$MergeThread.run(ConcurrentMergeScheduler.java:297)
Caused by: java.lang.AssertionError: after mergeFields: fdx
size
mismatch: 65184 docs vs 521464 length in bytes of _c9.fdx
        at org.apache.lucene.index.SegmentMerger.mergeFields
(SegmentMerger.java:347)
        at org.apache.lucene.index.SegmentMerger.merge
(SegmentMerger.java:133)
        at org.apache.lucene.index.IndexWriter.mergeMiddle
(IndexWriter.java:3852)
        at org.apache.lucene.index.IndexWriter.merge
(IndexWriter.java:3504)
        at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge
(ConcurrentMergeScheduler.java:211)
        at org.apache.lucene.index.ConcurrentMergeScheduler
$MergeThread.run(ConcurrentMergeScheduler.java:266)

Latest infostream attached.


--
Ian.


On Tue, Mar 18, 2008 at 6:05 PM, Michael McCandless
<[EMAIL PROTECTED]> wrote:

 Hi Ian,

Sheesh that's odd. The SegmentMerger produced an .fdx file
that is
 one document too short.

 Can you run with this patch now, again applied to head of
2.3
 branch?  I just added another assert inside the loop that
does the
 field merging.

 I will scrutinize this code...

 Mike




 Ian Lea wrote:
Mike


Patch applied and test re-run and picked up an assertion
error
this
time:

Exception in thread "Lucene Merge Thread #0"
org.apache.lucene.index.MergePolicy$MergeException:
java.lang.AssertionError: after mergeFields: fdx size
mismatch:
72357
docs vs 578848 length in bytes of _3o.fdx
        at
org.apache.lucene.index.ConcurrentMergeScheduler.handleMer ge
Ex
ce
pt
io
n(
ConcurrentMergeScheduler.java:320)
at org.apache.lucene.index.ConcurrentMergeScheduler
$MergeThread.run(ConcurrentMergeScheduler.java:297)
Caused by: java.lang.AssertionError: after mergeFields: fdx
size
mismatch: 72357 docs vs 578848 length in bytes of _3o.fdx
at org.apache.lucene.index.SegmentMerger.mergeFields
(SegmentMerger.java:342)
        at org.apache.lucene.index.SegmentMerger.merge
(SegmentMerger.java:133)
        at org.apache.lucene.index.IndexWriter.mergeMiddle
(IndexWriter.java:3852)
        at org.apache.lucene.index.IndexWriter.merge
(IndexWriter.java:3504)
        at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge
(ConcurrentMergeScheduler.java:211)
at org.apache.lucene.index.ConcurrentMergeScheduler
$MergeThread.run(ConcurrentMergeScheduler.java:266)

The infostream output is attached.  Since this email is
to you
and the
list it should make it to you.



Yonik: I haven't been able to make TestStressIndexing2 fail.


--
Ian.


On Tue, Mar 18, 2008 at 4:19 PM, Michael McCandless
<[EMAIL PROTECTED]> wrote:

 Ian,

Could you apply the attached patch applied to the head of
the
2.3
 branch?

 It only adds more asserts, to try to pinpoint where
exactly
this
 corruption starts.

Then, re-run the test with asserts enabled and infoStream
turned on
 and post back.  Thanks.

 Mike




 Ian Lea wrote:

It's failed on servers running SuSE 10.0 and 8.2
(ancient!)

$ uname -a shows
Linux phoebe 2.6.13-15-smp #1 SMP Tue Sep 13 14:56:15 UTC
2005
x86_64
x86_64 x86_64 GNU/Linux

and

Linux phobos 2.4.20-64GB-SMP #1 SMP Mon Mar 17 17:56:03
UTC
2003
i686
unknown unknown GNU/Linux

The first one has a 2.8Ghz Intel CPU, don't know about the
second.


I'll try and run the stress test.


--
Ian.



On Tue, Mar 18, 2008 at 2:17 PM, Yonik Seeley
<[EMAIL PROTECTED]>
wrote:

On Tue, Mar 18, 2008 at 7:38 AM, Ian Lea
<[EMAIL PROTECTED]>
wrote:
Hi


 When bulk loading into a new index I'm seeing this
exception

 Exception in thread "Thread-1"
 org.apache.lucene.index.MergePolicy$MergeException:
 org.apache.lucene.index.CorruptIndexException: doc
counts
differ
for
 segment _4l: fieldsReader shows 67861 but segmentInfo
shows
67862
        at
org.apache.lucene.index.ConcurrentMergeScheduler
$MergeThread.run(ConcurrentMergeScheduler.java:271)
 Caused by:
org.apache.lucene.index.CorruptIndexException: doc
counts
 differ for segment _4l: fieldsReader shows 67861 but
segmentInfo
shows
 67862
        at
org.apache.lucene.index.SegmentReader.initialize
(SegmentReader.java:313)
        at org.apache.lucene.index.SegmentReader.get
(SegmentReader.java:262)
        at org.apache.lucene.index.SegmentReader.get
(SegmentReader.java:221)
        at
org.apache.lucene.index.IndexWriter.mergeMiddle
(IndexWriter.java:3093)
        at org.apache.lucene.index.IndexWriter.merge
(IndexWriter.java:2834)
        at
org.apache.lucene.index.ConcurrentMergeScheduler
$MergeThread.run(ConcurrentMergeScheduler.java:240)

 when use java version 1.6.0_05-b13 or 1.6.0_04-b12 on
linux,
with
lucene 2.3.0 or 2.3.1 or lucene-core-2.3-SNAPSHOT from
yesterday.

 With java version 1.6.0_03-b05 things work fine.

The exception happens a few hundred thousand documents
into the
load.

 A different program updating a different index with
different
data on
 a different server gave a similar error on version
1.6.0_05-
b13 and
 lucene 2.3.0.

Any ideas? Is this maybe a known issue or am I missing
something obvious?

My guess is perhaps a thread safety bug, more likely in
Lucene
indexing code (less likely in the JVM or specific libc).

 What Linux version are you using?
 What hardware are you running on (specifically, the
CPU)?

 If possible, it would be great if you could check out
Lucene
trunk,
 crank up the iterations by modifying the
TestStressIndexing2 and
maybe
 fiddle with some of the other parameters in
 TestStressIndexing2.testMultiConfig(), and see if you
can
get
it to
 fail.


 -Yonik


------------------------------------------------------- --
--
--
--
--
--
--


To unsubscribe, e-mail: java-user-
[EMAIL PROTECTED]
 For additional commands, e-mail: java-user-
[EMAIL PROTECTED]



-------------------------------------------------------- --
--
--
--
--
--
-
To unsubscribe, e-mail: java-user-
[EMAIL PROTECTED]
For additional commands, e-mail: java-user-
[EMAIL PROTECTED]




<infostream.zip>



<infostream.zip>


<SegmentMerger.java>


<infostream.zip>



<infostream.zip>





<infostream.zip>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to