[jira] Commented: (LUCENE-2729) Index corruption after 'read past EOF' under heavy update load and snapshot export

Michael McCandless (JIRA) Tue, 09 Nov 2010 07:16:26 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930130#action_12930130
 ]


Michael McCandless commented on LUCENE-2729:
--------------------------------------------


Thank you for attaching the IW infoStream output!  Sorry it took so
long for me to respond.

Aside: it is sad but there is no master TODO list in open source.  It
all comes down to our own email inboxes, todo lists, post-it notes all
over the place, etc., and (in my case anyway) things sometimes fall
past the event horizon.

So please if I don't respond in a day or two on an active issue, bump
it again (put a comment on the issue)!  I'd much rather people
over-nag than under-nag but unfortunately under-nag is far far more
common and it causes important issues to languish unnecessarily.

OK back to the issue :)

I looked through the infoStream but I don't see a smoking gun.  Ie,
the logs indicate that nowhere did Lucene try to delete/overwrite
those zero-length files; I see other files being deleted, so, this is
what I'd expect given that ZoieDeletionPolicy is presumably protecting
the segments_3t commit point (to back up its files).

I do see some spooky "other" exceptions, though... these are the first
2 exceptions I see in the log:

{noformat}
14:27:41,290 [bigIndexBuilder_QueueProcessor_3] WARN  
com.ds.acm.logic.impl.AssetManagerImpl -
  Ignoring AssetNotFoundException trying to make sure all metadata from index 
is loaded before updating an existing asset
Exception in thread "pool-5-thread-6" java.lang.NullPointerException
        at 
org.apache.coyote.http11.InternalNioOutputBuffer.writeToSocket(InternalNioOutputBuffer.java:430)
        at 
org.apache.coyote.http11.InternalNioOutputBuffer.flushBuffer(InternalNioOutputBuffer.java:784)
        at 
org.apache.coyote.http11.InternalNioOutputBuffer.flush(InternalNioOutputBuffer.java:300)
        at 
org.apache.coyote.http11.Http11NioProcessor.action(Http11NioProcessor.java:1060)
        at org.apache.coyote.Response.action(Response.java:183)
        at 
org.apache.catalina.connector.OutputBuffer.doFlush(OutputBuffer.java:314)
        at 
org.apache.catalina.connector.OutputBuffer.flush(OutputBuffer.java:288)
        at org.apache.catalina.connector.Response.flushBuffer(Response.java:548)
        at 
org.apache.catalina.connector.ResponseFacade.flushBuffer(ResponseFacade.java:279)
        at 
org.granite.gravity.AbstractChannel.runReceived(AbstractChannel.java:251)
        at 
org.granite.gravity.AbstractChannel.runReceive(AbstractChannel.java:199)
        at org.granite.gravity.AsyncReceiver.doRun(AsyncReceiver.java:34)
        at 
org.granite.gravity.AsyncChannelRunner.run(AsyncChannelRunner.java:52)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:680)
{noformat}

and

{noformat}
14:40:18,382 [Low Memory Detector] WARN  
com.ds.acm.engine.search.zoieimpl.core.ZoieSystemManager -
  Ignoring timeout while attempting to flush zoie memory index to disk to free 
memory
proj.zoie.api.ZoieException: sync timed out
        at 
proj.zoie.impl.indexing.AsyncDataConsumer.syncWthVersion(AsyncDataConsumer.java:177)
        at 
proj.zoie.impl.indexing.AsyncDataConsumer.flushEvents(AsyncDataConsumer.java:155)
        at proj.zoie.impl.indexing.ZoieSystem.flushEvents(ZoieSystem.java:308)
        at 
com.ds.acm.engine.search.zoieimpl.core.ZoieSystemManager.onLowMemory(ZoieSystemManager.java:220)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at 
com.ds.util.event.BasicEventBroadcaster$Handler.invokeMethod(BasicEventBroadcaster.java:197)
        at 
com.ds.util.event.BasicEventBroadcaster$Handler.handle(BasicEventBroadcaster.java:190)
        at 
com.ds.util.event.BasicEventBroadcaster.fire(BasicEventBroadcaster.java:108)
        at 
com.ds.util.cache.LowMemoryWarningBroadcaster$1.handleNotification(LowMemoryWarningBroadcaster.java:135)
        at 
sun.management.NotificationEmitterSupport.sendNotification(NotificationEmitterSupport.java:138)
        at sun.management.MemoryImpl.createNotification(MemoryImpl.java:171)
        at 
sun.management.MemoryPoolImpl$PoolSensor.triggerAction(MemoryPoolImpl.java:272)
        at sun.management.Sensor.trigger(Sensor.java:120)
{noformat}

That 2nd exception happens a total of 9 times... and is rather
spooky.  What does it mean?  Ie, why is Zoie timing out on flushing
the index to disk, and, what does it then do w/ its RAMDir?

I also see alot of these:

{noformat}
15:50:18,856 [bigIndexBuilder_QueueProcessor_10] WARN  
com.ds.acm.logic.impl.AssetManagerImpl -
  Ignoring AssetNotFoundException trying to make sure all metadata from index 
is loaded before
  updating an existing asset
{noformat}

Would it be possible to instrument to Zoie code to note as the backup
process is copying each file in the snapshot, and at that point print
a listing of the directory?

Also, can you write to the log when Zoie applies deletes?  (Looks like
it happens in proj.zoie.impl.indexing.internal.BaseSearchIndex.deleteDocs).
It's on applying deletes that the corruption is first detected, so, if
we log this event we can better bracket the period of time when the
corruption happened.

Does Zoie ever open an IndexReader or IndexWriter passing in an
existing commit point?  Or does it always open the latest commit?

The timestamps on the zero length files are particularly spooky -- the
earliest ones are 15:21 (when first EOF is hit), but then also 15:47
and 15:49 on the others.  It seems like on 3 separate occasions
something truncated the files.


> Index corruption after 'read past EOF' under heavy update load and snapshot 
> export
> ----------------------------------------------------------------------------------
>
>                 Key: LUCENE-2729
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2729
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 3.0.1, 3.0.2
>         Environment: Happens on both OS X 10.6 and Windows 2008 Server. 
> Integrated with zoie (using a zoie snapshot from 2010-08-06: 
> zoie-2.0.0-snapshot-20100806.jar).
>            Reporter: Nico Krijnen
>         Attachments: 2010-11-02 IndexWriter infoStream log.zip, 
> LUCENE-2729-test1.patch
>
>
> We have a system running lucene and zoie. We use lucene as a content store 
> for a CMS/DAM system. We use the hot-backup feature of zoie to make scheduled 
> backups of the index. This works fine for small indexes and when there are 
> not a lot of changes to the index when the backup is made.
> On large indexes (about 5 GB to 19 GB), when a backup is made while the index 
> is being changed a lot (lots of document additions and/or deletions), we 
> almost always get a 'read past EOF' at some point, followed by lots of 'Lock 
> obtain timed out'.
> At that point we get lots of 0 kb files in the index, data gets lots, and the 
> index is unusable.
> When we stop our server, remove the 0kb files and restart our server, the 
> index is operational again, but data has been lost.
> I'm not sure if this is a zoie or a lucene issue, so i'm posting it to both. 
> Hopefully someone has some ideas where to look to fix this.
> Some more details...
> Stack trace of the read past EOF and following Lock obtain timed out:
> {code}
> 78307 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] 
> ERROR proj.zoie.impl.indexing.internal.BaseSearchIndex - read past EOF
> java.io.IOException: read past EOF
>     at 
> org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:154)
>     at 
> org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39)
>     at 
> org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:37)
>     at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:69)
>     at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:245)
>     at 
> org.apache.lucene.index.IndexFileDeleter.<init>(IndexFileDeleter.java:166)
>     at 
> org.apache.lucene.index.DirectoryReader.doCommit(DirectoryReader.java:725)
>     at org.apache.lucene.index.IndexReader.commit(IndexReader.java:987)
>     at org.apache.lucene.index.IndexReader.commit(IndexReader.java:973)
>     at org.apache.lucene.index.IndexReader.decRef(IndexReader.java:162)
>     at org.apache.lucene.index.IndexReader.close(IndexReader.java:1003)
>     at 
> proj.zoie.impl.indexing.internal.BaseSearchIndex.deleteDocs(BaseSearchIndex.java:203)
>     at 
> proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:223)
>     at 
> proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153)
>     at 
> proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134)
>     at 
> proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171)
>     at 
> proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:373)
> 579336 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] 
> ERROR proj.zoie.impl.indexing.internal.LuceneIndexDataLoader - 
>     Problem copying segments: Lock obtain timed out: 
> org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock
> org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: 
> org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock
>     at org.apache.lucene.store.Lock.obtain(Lock.java:84)
>     at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1060)
>     at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:957)
>     at 
> proj.zoie.impl.indexing.internal.DiskSearchIndex.openIndexWriter(DiskSearchIndex.java:176)
>     at 
> proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:228)
>     at 
> proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153)
>     at 
> proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134)
>     at 
> proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171)
>     at 
> proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:373)
> {code}
> We get exactly the same behavour on both OS X and on Windows. On both zoie is 
> using a SimpleFSDirectory.
> We also use a SingleInstanceLockFactory (since our process is the only one 
> working with the index), but we get the same behaviour with a NativeFSLock.
> The snapshot backup is being made by calling:
> *proj.zoie.impl.indexing.ZoieSystem.exportSnapshot(WritableByteChannel)*
> Same issue in zoie JIRA:
> http://snaprojects.jira.com/browse/ZOIE-51

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2729) Index corruption after 'read past EOF' under heavy update load and snapshot export

Reply via email to