[
https://issues.apache.org/jira/browse/OAK-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629197#comment-14629197
]
Chetan Mehrotra edited comment on OAK-3110 at 7/16/15 5:05 AM:
---------------------------------------------------------------
Attaching [logs|^copier.log] from IndexCopier and AsyncIndexUpdate for a single
run where the run failed due to exception in CopyOnWrite. The issue here has
its root in OAK-2855
h3. Interaction between CopyOnWrite (COW) and CopyOnRead (COR)
COW logic is implemented with the requirement that it has to *share* the same
local directory as used in COR. This ensures that it can make use of already
copied index files while indexing. Any indexing sequence would work in
following sequence
# COW - Index done - Index moved to R1
# COR - Open at R1
# COW - Index moves to R2
# COR - Picks up index at R2
# COW - Starts indexing for R3
Further COR takes a snapshot of current local directory content and when it
closes it deletes those files which are present in local directory but not
present in remote directory. This ensures that issues seen in OAK-2855 do not
happen. This approach would have worked fine if above sequence is strictly
sequential. However in real world thats not the case as the 2 tasks are
performed via different threads.
For e.g. say at #3 when COW is done with indexing the Lucene index update
observor would detect that index has updated. This observation processing would
happen in a different thread. Now when at #4 COR is opening the index it can
happen that COW has also started with its index at #5 and hence started
creating new files. These new files would be picked by COR at #4 during its
opening phase and it would consider them part of old index. Now when COR would
close it would delete these files and that would then result in a
FileNotFoundException for COW in #5
h3. Mapping to Logs
In the attached logs following timeline is seen
{noformat}
14:20:01.700 *DEBUG* [pool-58-thread-1] AsyncIndexUpdate Running background
index task async
14:20:08.601 *DEBUG* [pool-58-thread-1] IndexCopier [COW] Creating output
_2s7.fdt
14:20:18.146 *DEBUG* [oak-lucene-0] IndexCopier [/oak:index/foxDamIndex]
Following files have been removed from Lucene index directory [_2s7.fdt,
_2s7.fdx]
14:20:35.752 *DEBUG* [oak-lucene-0] IndexCopier Error occurred while copying
files. Further processing would be skipped
java.io.FileNotFoundException: _2s7.fdt
at org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:261)
14:20:35.930 *WARN* [pool-58-thread-1] AsyncIndexUpdate The async index update
failed
{noformat}
So as can be seen that COW starts at 20:01 and creates _2s7.fdt at 20:08.
However that file got deleted by oak-lucene-0 (thread which opens index with
COR) at 20:18 and then eventually FileNotFoundException is thrown when copy
logic in COW tries to read the file
h3. Fix
As a fix COW and COR would have to share the current working set and COR should
not touch files in that working set
was (Author: chetanm):
Attaching [logs|^copier.log] from IndexCopier and AsyncIndexUpdate for a single
run where the run failed due to exception in CopyOnWrite. The issue here has
its root in OAK-2855
h3. Interaction between CopyOnWrite (COW) and CopyOnRead (COR)
COW logic is implemented with the requirement that it has sure the same local
directory as used in COR. This ensures that it can make use of alread copied
index files while indexing. Any indexing sequence would work in following
sequence
# COW - Index done - Index moved to R1
# COR - Open at R1
# COW - Index moves to R2
# COR - Picks up index at R2
# COW - Starts indexing for R3
Further COR takes a snapshot of current local directory content and when it
closes it deletes those files which are present in local directory but not
present in remote directory. This ensures that issues seen in OAK-2855 do not
happen. This approach would have worked fine if above sequence is strictly
sequential. However in real world thats not the case as the 2 task are
performed via different threads.
For e.g. say at #3 when COW is done with indexing the Lucene index update
observor would detect that index has updated. This observation processing would
happen in a different thread. Now when at #4 COR is opening the index it can
happen that COW has also started with its index at #5 and hence started
creating new files. These new files would be picked by COR at #4 during its
opening phase and it would consider them part of old index. Now when COR would
close it would delete these files and that would then result in a
FileNotFoundException for COW in #5
h3. Mapping to Logs
In the attached logs following timeline is seen
{noformat}
14:20:01.700 *DEBUG* [pool-58-thread-1] AsyncIndexUpdate Running background
index task async
14:20:08.601 *DEBUG* [pool-58-thread-1] IndexCopier [COW] Creating output
_2s7.fdt
14:20:18.146 *DEBUG* [oak-lucene-0] IndexCopier [/oak:index/foxDamIndex]
Following files have been removed from Lucene index directory [_2s7.fdt,
_2s7.fdx]
14:20:35.752 *DEBUG* [oak-lucene-0] IndexCopier Error occurred while copying
files. Further processing would be skipped
java.io.FileNotFoundException: _2s7.fdt
at org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:261)
14:20:35.930 *WARN* [pool-58-thread-1] AsyncIndexUpdate The async index update
failed
{noformat}
So as can be seen that COW starts at 20:01 and creates _2s7.fdt at 20:08.
However that file got deleted by oak-lucene-0 (thread which opens index with
COR) at 20:18 and then eventually FileNotFoundException is thrown when copy
logic in COW tries to read the file
h3. Fix
As a fix COW and COR would have to share the current working set and COR should
not touch files in that working set
> AsyncIndexer fails due to FileNotFoundException thrown by CopyOnWrite logic
> ---------------------------------------------------------------------------
>
> Key: OAK-3110
> URL: https://issues.apache.org/jira/browse/OAK-3110
> Project: Jackrabbit Oak
> Issue Type: Bug
> Components: lucene
> Reporter: Chetan Mehrotra
> Assignee: Chetan Mehrotra
> Fix For: 1.2.3, 1.3.3, 1.0.18
>
> Attachments: copier.log
>
>
> At times the CopyOnWrite reports following exception
> {noformat}
> 15.07.2015 14:20:35.930 *WARN* [pool-58-thread-1]
> org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate The async index
> update failed
> org.apache.jackrabbit.oak.api.CommitFailedException: OakLucene0004: Failed to
> close the Lucene index
> at
> org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditor.leave(LuceneIndexEditor.java:204)
> at
> org.apache.jackrabbit.oak.plugins.index.IndexUpdate.leave(IndexUpdate.java:219)
> at
> org.apache.jackrabbit.oak.spi.commit.VisibleEditor.leave(VisibleEditor.java:63)
> at
> org.apache.jackrabbit.oak.spi.commit.EditorDiff.process(EditorDiff.java:56)
> at
> org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate.updateIndex(AsyncIndexUpdate.java:366)
> at
> org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate.run(AsyncIndexUpdate.java:311)
> at
> org.apache.sling.commons.scheduler.impl.QuartzJobExecutor.execute(QuartzJobExecutor.java:105)
> at org.quartz.core.JobRunShell.run(JobRunShell.java:207)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.FileNotFoundException: _2s7.fdt
> at org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:261)
> at
> org.apache.jackrabbit.oak.plugins.index.lucene.IndexCopier$CopyOnWriteDirectory$COWLocalFileReference.fileLength(IndexCopier.java:837)
> at
> org.apache.jackrabbit.oak.plugins.index.lucene.IndexCopier$CopyOnWriteDirectory.fileLength(IndexCopier.java:607)
> at
> org.apache.lucene.index.SegmentCommitInfo.sizeInBytes(SegmentCommitInfo.java:141)
> at
> org.apache.lucene.index.DocumentsWriterPerThread.sealFlushedSegment(DocumentsWriterPerThread.java:529)
> at
> org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:502)
> at
> org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:508)
> at
> org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:618)
> at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3147)
> at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3123)
> at
> org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:988)
> at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:932)
> at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:894)
> at
> org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditorContext.closeWriter(LuceneIndexEditorContext.java:192)
> at
> org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditor.leave(LuceneIndexEditor.java:202)
> ... 10 common frames omitted
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)