[ 
https://issues.apache.org/jira/browse/OAK-4114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chetan Mehrotra resolved OAK-4114.
----------------------------------
       Resolution: Fixed
    Fix Version/s: 1.5.14

Did a fix with 1769940, 1769939

Now before and Directory is opened for any index a simple sanity check is done 
where local index files are compared against remote in terms of file size. In 
case of any mismatch whole of the local index dir is purged and the files would 
be fetched again later

This check is weak as it compares file size only. But should catch most of such 
cases.

Later we can look into some checksum based approach

> Cached lucene index gets corrupted in case of unclean shutdown and journal 
> rollback in SegmentNodeStore
> -------------------------------------------------------------------------------------------------------
>
>                 Key: OAK-4114
>                 URL: https://issues.apache.org/jira/browse/OAK-4114
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>          Components: lucene
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>            Priority: Critical
>              Labels: candidate_oak_1_0, candidate_oak_1_2, candidate_oak_1_4, 
> resilience
>             Fix For: 1.6, 1.5.14
>
>
> Currently Oak Lucene support would copy index files to local file system as 
> part of CopyOnRead feature. In one of the setup it has been observed that 
> index logic was failing with following error
> {noformat}
> 04.02.2016 17:47:52.391 *WARN* [oak-lucene-3] 
> org.apache.jackrabbit.oak.plugins.index.lucene.IndexCopier 
> [/oak:index/lucene] Found local copy for _2ala.cfs in 
> MMapDirectory@/mnt/crx/author/crx-quickstart/repository/index/e5a943cdec3000bd8ce54924fd2070ab5d1d35b9ecf530963a3583d43bf28293/1
>  
> lockFactory=NativeFSLockFactory@/mnt/crx/author/crx-quickstart/repository/index/e5a943cdec3000bd8ce54924fd2070ab5d1d35b9ecf530963a3583d43bf28293/1
>  but size of local 9320 differs from remote 3714150. Content would be read 
> from remote file only
> 04.02.2016 17:47:52.399 *WARN* [oak-lucene-3] 
> org.apache.jackrabbit.oak.plugins.index.lucene.IndexCopier 
> [/oak:index/lucene] Found local copy for segments_28je in 
> MMapDirectory@/mnt/crx/author/crx-quickstart/repository/index/e5a943cdec3000bd8ce54924fd2070ab5d1d35b9ecf530963a3583d43bf28293/1
>  
> lockFactory=NativeFSLockFactory@/mnt/crx/author/crx-quickstart/repository/index/e5a943cdec3000bd8ce54924fd2070ab5d1d35b9ecf530963a3583d43bf28293/1
>  but size of local 1214 differs from remote 1175. Content would be read from 
> remote file only
> 04.02.2016 17:47:52.491 *ERROR* [oak-lucene-3] 
> org.apache.jackrabbit.oak.plugins.index.lucene.IndexTracker Failed to open 
> Lucene index at /oak:index/lucene
> org.apache.lucene.index.CorruptIndexException: codec header mismatch: actual 
> header=1953790076 vs expected header=1071082519 (resource: 
> SlicedIndexInput(SlicedIndexInput(_2ala.fnm in _2ala.cfs) in _2ala.cfs 
> slice=8810:9320))
>       at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:128)
>       at 
> org.apache.lucene.codecs.lucene46.Lucene46FieldInfosReader.read(Lucene46FieldInfosReader.java:56)
>       at 
> org.apache.lucene.index.SegmentReader.readFieldInfos(SegmentReader.java:215)
> {noformat}
> Here size of __2ala.cfs_ differed from remote copy and possible other index 
> file may have same size but different content. Comparing the modified time of 
> the files with those in Oak it can be seen that one of file system was older 
> than one in Oak
> {noformat}
> _2alr.cfs={name=_2alr.cfs, size=1152402, sizeStr=1.2 MB, modified=Thu Feb 04 
> 17:52:31 GMT 2016, osModified=Feb 4 17:52, osSize=1152402, mismatch=false}
> _2ala.cfe={name=_2ala.cfe, size=224, sizeStr=224 B, modified=Thu Feb 04 
> 17:47:28 GMT 2016, osModified=Feb 4 17:17, osSize=224, mismatch=false}
> _2ala.si={name=_2ala.si, size=252, sizeStr=252 B, modified=Thu Feb 04 
> 17:47:28 GMT 2016, osModified=Feb 4 17:17, osSize=252, mismatch=false}
> _2ala.cfs={name=_2ala.cfs, size=3714150, sizeStr=3.7 MB, modified=Thu Feb 04 
> 17:47:28 GMT 2016, osModified=Feb 4 17:17, osSize=9320, mismatch=true}
> _14u3_29.del={name=_14u3_29.del, size=1244036, sizeStr=1.2 MB, modified=Thu 
> Feb 04 16:37:35 GMT 2016, osModified=Feb 4 16:37, osSize=1244036, 
> mismatch=false}
> _2akw.si={name=_2akw.si, size=252, sizeStr=252 B, modified=Thu Feb 04 
> 16:37:07 GMT 2016, osModified=Feb 4 16:37, osSize=252, mismatch=false}
> _2akw.cfe={name=_2akw.cfe, size=224, sizeStr=224 B, modified=Thu Feb 04 
> 16:37:07 GMT 2016, osModified=Feb 4 16:37, osSize=224, mismatch=false}
> _2akw.cfs={name=_2akw.cfs, size=4952761, sizeStr=5.0 MB, modified=Thu Feb 04 
> 16:37:07 GMT 2016, osModified=Feb 4 16:37, osSize=4952761, mismatch=false}
> {noformat}
> And on same setup the system did saw a rollback in segment node store 
> {noformat}
> -rw-rw-r--. 1 crx crx  25961984 Feb  4 16:47 data01357a.tar
> -rw-rw-r--. 1 crx crx  24385536 Feb  4 16:41 data01357a.tar.bak
> -rw-rw-r--. 1 crx crx    359936 Feb  4 17:18 data01358a.tar
> -rw-rw-r--. 1 crx crx    345088 Feb  4 17:17 data01358a.tar.bak
> -rw-rw-r--. 1 crx crx  70582272 Feb  4 18:35 data01359a.tar
> -rw-rw-r--. 1 crx crx  66359296 Feb  4 18:33 data01359a.tar.bak
> -rw-rw-r--. 1 crx crx    282112 Feb  4 18:46 data01360a.tar
> -rw-rw-r--. 1 crx crx    236544 Feb  4 18:45 data01360a.tar.bak
> -rw-rw-r--. 1 crx crx    138240 Feb  4 18:56 data01361a.tar
> {noformat}
> So one possible cause is that 
> # At some time earlier to 17:17 lucene index got updated and __2ala.cfs_ got 
> created. 
> # Post update the head revision in Segment store was updated but the revision 
> yet to made it to journal log
> # Lucene CopyOnRead logic got event for the change and copied the file
> # System crashed and hence journal did not got updated
> # System restarted and per last entry in journal system suffered with some 
> "data loss" and hence index checkpoint also moved back
> # As checkpoint got reverted index started at earlier state and hence created 
> a file with same name __2ala.cfs_ 
> # CopyOnRead detected file length change and logged a warning routing call to 
> remote
> # However other files like _2ala.si, _2ala.cfe which were created in same 
> commit had same size but likely different content which later cause lucene 
> query to start failing
> In such a case a restart after cleaning the existing index content would have 
> brought back the system to normal state.
> So as a fix we would need to come up with some sanity check at time of system 
> startup



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to