[
https://issues.apache.org/jira/browse/OAK-4114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
angela updated OAK-4114:
------------------------
Priority: Critical (was: Minor)
> Cached lucene index gets corrupted in case of unclean shutdown and journal
> rollback in SegmentNodeStore
> -------------------------------------------------------------------------------------------------------
>
> Key: OAK-4114
> URL: https://issues.apache.org/jira/browse/OAK-4114
> Project: Jackrabbit Oak
> Issue Type: Bug
> Components: lucene
> Reporter: Chetan Mehrotra
> Assignee: Chetan Mehrotra
> Priority: Critical
> Labels: resilience
> Fix For: 1.6
>
>
> Currently Oak Lucene support would copy index files to local file system as
> part of CopyOnRead feature. In one of the setup it has been observed that
> index logic was failing with following error
> {noformat}
> 04.02.2016 17:47:52.391 *WARN* [oak-lucene-3]
> org.apache.jackrabbit.oak.plugins.index.lucene.IndexCopier
> [/oak:index/lucene] Found local copy for _2ala.cfs in
> MMapDirectory@/mnt/crx/author/crx-quickstart/repository/index/e5a943cdec3000bd8ce54924fd2070ab5d1d35b9ecf530963a3583d43bf28293/1
>
> lockFactory=NativeFSLockFactory@/mnt/crx/author/crx-quickstart/repository/index/e5a943cdec3000bd8ce54924fd2070ab5d1d35b9ecf530963a3583d43bf28293/1
> but size of local 9320 differs from remote 3714150. Content would be read
> from remote file only
> 04.02.2016 17:47:52.399 *WARN* [oak-lucene-3]
> org.apache.jackrabbit.oak.plugins.index.lucene.IndexCopier
> [/oak:index/lucene] Found local copy for segments_28je in
> MMapDirectory@/mnt/crx/author/crx-quickstart/repository/index/e5a943cdec3000bd8ce54924fd2070ab5d1d35b9ecf530963a3583d43bf28293/1
>
> lockFactory=NativeFSLockFactory@/mnt/crx/author/crx-quickstart/repository/index/e5a943cdec3000bd8ce54924fd2070ab5d1d35b9ecf530963a3583d43bf28293/1
> but size of local 1214 differs from remote 1175. Content would be read from
> remote file only
> 04.02.2016 17:47:52.491 *ERROR* [oak-lucene-3]
> org.apache.jackrabbit.oak.plugins.index.lucene.IndexTracker Failed to open
> Lucene index at /oak:index/lucene
> org.apache.lucene.index.CorruptIndexException: codec header mismatch: actual
> header=1953790076 vs expected header=1071082519 (resource:
> SlicedIndexInput(SlicedIndexInput(_2ala.fnm in _2ala.cfs) in _2ala.cfs
> slice=8810:9320))
> at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:128)
> at
> org.apache.lucene.codecs.lucene46.Lucene46FieldInfosReader.read(Lucene46FieldInfosReader.java:56)
> at
> org.apache.lucene.index.SegmentReader.readFieldInfos(SegmentReader.java:215)
> {noformat}
> Here size of __2ala.cfs_ differed from remote copy and possible other index
> file may have same size but different content. Comparing the modified time of
> the files with those in Oak it can be seen that one of file system was older
> than one in Oak
> {noformat}
> _2alr.cfs={name=_2alr.cfs, size=1152402, sizeStr=1.2 MB, modified=Thu Feb 04
> 17:52:31 GMT 2016, osModified=Feb 4 17:52, osSize=1152402, mismatch=false}
> _2ala.cfe={name=_2ala.cfe, size=224, sizeStr=224 B, modified=Thu Feb 04
> 17:47:28 GMT 2016, osModified=Feb 4 17:17, osSize=224, mismatch=false}
> _2ala.si={name=_2ala.si, size=252, sizeStr=252 B, modified=Thu Feb 04
> 17:47:28 GMT 2016, osModified=Feb 4 17:17, osSize=252, mismatch=false}
> _2ala.cfs={name=_2ala.cfs, size=3714150, sizeStr=3.7 MB, modified=Thu Feb 04
> 17:47:28 GMT 2016, osModified=Feb 4 17:17, osSize=9320, mismatch=true}
> _14u3_29.del={name=_14u3_29.del, size=1244036, sizeStr=1.2 MB, modified=Thu
> Feb 04 16:37:35 GMT 2016, osModified=Feb 4 16:37, osSize=1244036,
> mismatch=false}
> _2akw.si={name=_2akw.si, size=252, sizeStr=252 B, modified=Thu Feb 04
> 16:37:07 GMT 2016, osModified=Feb 4 16:37, osSize=252, mismatch=false}
> _2akw.cfe={name=_2akw.cfe, size=224, sizeStr=224 B, modified=Thu Feb 04
> 16:37:07 GMT 2016, osModified=Feb 4 16:37, osSize=224, mismatch=false}
> _2akw.cfs={name=_2akw.cfs, size=4952761, sizeStr=5.0 MB, modified=Thu Feb 04
> 16:37:07 GMT 2016, osModified=Feb 4 16:37, osSize=4952761, mismatch=false}
> {noformat}
> And on same setup the system did saw a rollback in segment node store
> {noformat}
> -rw-rw-r--. 1 crx crx 25961984 Feb 4 16:47 data01357a.tar
> -rw-rw-r--. 1 crx crx 24385536 Feb 4 16:41 data01357a.tar.bak
> -rw-rw-r--. 1 crx crx 359936 Feb 4 17:18 data01358a.tar
> -rw-rw-r--. 1 crx crx 345088 Feb 4 17:17 data01358a.tar.bak
> -rw-rw-r--. 1 crx crx 70582272 Feb 4 18:35 data01359a.tar
> -rw-rw-r--. 1 crx crx 66359296 Feb 4 18:33 data01359a.tar.bak
> -rw-rw-r--. 1 crx crx 282112 Feb 4 18:46 data01360a.tar
> -rw-rw-r--. 1 crx crx 236544 Feb 4 18:45 data01360a.tar.bak
> -rw-rw-r--. 1 crx crx 138240 Feb 4 18:56 data01361a.tar
> {noformat}
> So one possible cause is that
> # At some time earlier to 17:17 lucene index got updated and __2ala.cfs_ got
> created.
> # Post update the head revision in Segment store was updated but the revision
> yet to made it to journal log
> # Lucene CopyOnRead logic got event for the change and copied the file
> # System crashed and hence journal did not got updated
> # System restarted and per last entry in journal system suffered with some
> "data loss" and hence index checkpoint also moved back
> # As checkpoint got reverted index started at earlier state and hence created
> a file with same name __2ala.cfs_
> # CopyOnRead detected file length change and logged a warning routing call to
> remote
> # However other files like _2ala.si, _2ala.cfe which were created in same
> commit had same size but likely different content which later cause lucene
> query to start failing
> In such a case a restart after cleaning the existing index content would have
> brought back the system to normal state.
> So as a fix we would need to come up with some sanity check at time of system
> startup
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)