Hi Robert,

Regrading to Lucene checksum validation introduced in 4.8.0, I think we can try 
that to detect hardware failure at early stage. Do you have performance numbers 
of overhead the checksum would have?

And also, by navigating Lucene bugs, LUCENE-6192 is quite similar to ours. We 
also have a single term field with high frequency, although only one position 
per doc. In the comments of JIRA, Michael mentioned that skip data will 
regenerated during merging/optimizing. 

“The good thing about skip data is it's ignored during merging, so to test this 
you just need to apply the patch, compile & deploy Lucene core JAR, then 
optimize so the skip data is regenerated...”

I just want to confirm with you guys that if it is a same issue, then I use 
4.10.5 to do a merge/optimize, then the new created segment will have a new 
good skipdata that the corruption is fixed.

On 2/2/18, 10:33 PM, "Robert Muir" <rcm...@gmail.com> wrote:

    I agree that it may be a useful test to narrow the problem down.
    
    But given that you have deleted docs, i'm not sure what conclusions
    could be drawn from it, because lots of other changes will happen too
    (e.g. docs, positions, etc will compress differently).
    
    On Fri, Feb 2, 2018 at 9:19 AM, Tony Ma <t...@opentext.com> wrote:
    > Thanks Rebert.
    >
    > We are not going to use merge to repair corrupted index, the issue we are 
seeing is that as a segment is already got corrupted, but merges usually run 
automatically in background, I am trying to know that when this scenario 
occurs, will merge stop with an exception or will merge complete with a new 
corrupted segment.
    >
    > To be specific, we got a corrupted segment with following check index 
output,
    >   1 of 5: name=_0 docCount=8341939
    >     codec=Lucene45
    >     compound=false
    >     numFiles=48
    >     size (MB)=16,446.275
    >     diagnostics = {os=Windows Server 2008 R2, java.vendor=Oracle 
Corporation, java.version=1.7.0_80, lucene.version=4.5.1 1533280 - mark - 
2013-10-17 21:37:01, mergeMaxNumSegments=5, os.arch=amd64, source=merge, 
mergeFactor=6, timestamp=1514627603337, os.version=6.1}
    >     has deletions [delGen=130]
    >     test: open reader.........OK [4022 deleted docs]
    >     test: fields..............OK [268 fields]
    >     test: field norms.........OK [3 fields]
    >     test: terms, freq, prox...ERROR: 
java.lang.ArrayIndexOutOfBoundsException: 105
    > java.lang.ArrayIndexOutOfBoundsException: 105
    >         at 
org.apache.lucene.codecs.lucene41.ForUtil.readBlock(ForUtil.java:196)
    >         at 
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$EverythingEnum.refillPositions(Lucene41PostingsReader.java:1284)
    >         at 
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$EverythingEnum.skipPositions(Lucene41PostingsReader.java:1505)
    >         at 
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$EverythingEnum.nextPosition(Lucene41PostingsReader.java:1548)
    >         at 
org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:979)
    >         at 
org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:1232)
    >         at 
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:623)
    >         at 
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:372)
    >
    >
    > In checkindex, it will first check each position (all pass) and then do a 
skip test(fail), and corruption seems to appear at skiplist. I am wondering at 
this special case, it is possible that merge reconstruct a new skiplist because 
each position is fine.
    >
    > So that at least I can know this segment is newly corrupted one or it is 
previous corrupted and merge to a new one.
    >
    >
    > On 2/2/18, 9:58 PM, "Robert Muir" <rcm...@gmail.com> wrote:
    >
    >     IMO this is not something you want to do.
    >
    >     The only remedy CheckIndex has for a corrupted segment is to drop it
    >     completely: and if you choose to do that then you lose all the
    >     documents in that segment. So its not very useful to merge it with
    >     other segments into bigger corrupted segments since it will just
    >     spread more corruption.
    >
    >     On Fri, Feb 2, 2018 at 3:08 AM, Tony Ma <t...@opentext.com> wrote:
    >     > Hi experts,
    >     >
    >     > A question to corrupted index. If an index segment is already 
corrupted, can it be merged with another segment. Or it depends on where it got 
corrupted, for example corrupted in .pay file?
    >     >
    >     > From: 马江 <t...@opentext.com>
    >     > Date: Friday, January 19, 2018 at 9:52 AM
    >     > To: "java-user@lucene.apache.org" <java-user@lucene.apache.org>
    >     > Subject: Re: [EXTERNAL] - Lucene 4.5.1 payload corruption - 
ArrayIndexOutOfBoundsException
    >     >
    >     > Hi experts,
    >     >
    >     > Still about this issue, is there any known bug which will cause 
payload file corruption? The stack trace indicates that the fisrt byte of input 
should be an Integer <= 32, but actually it is 110.
    >     > Our customers seeing this kind of corruption several times, and all 
of the corruption is from payload. Is there any possibility that the bytes put 
into payload being incompatible with payload codec?
    >     >
    >     >
    >     >   void readBlock(IndexInput in, byte[] encoded, int[] decoded) 
throws IOException {
    >     >     final int numBits = in.readByte();
    >     >     assert numBits <= 32 : numBits;
    >     >
    >     >     if (numBits == ALL_VALUES_EQUAL) {
    >     >       final int value = in.readVInt();
    >     >       Arrays.fill(decoded, 0, BLOCK_SIZE, value);
    >     >       return;
    >     >     }
    >     >
    >     >     final int encodedSize = encodedSizes[numBits];
    >     >     in.readBytes(encoded, 0, encodedSize);
    >     >
    >     >
    >     > From: 马江 <t...@opentext.com>
    >     > Reply-To: "java-user@lucene.apache.org" 
<java-user@lucene.apache.org>
    >     > Date: Tuesday, January 16, 2018 at 11:16 AM
    >     > To: "java-user@lucene.apache.org" <java-user@lucene.apache.org>
    >     > Subject: [EXTERNAL] - Lucene 4.5.1 payload corruption - 
ArrayIndexOutOfBoundsException
    >     >
    >     > Hi experts,
    >     >
    >     > Recently one of our customer continuously seeing 
ArrayIndexOutOfBoundsException which is thrown from Lucene.
    >     >
    >     > Our production is full-text search engine built on top of Lucene, 
following is the stack traces. The customer saying that they can reproduce the 
issue even after re-index everything from scratch.
    >     >
    >     > Caused by: java.lang.ArrayIndexOutOfBoundsException: 110
    >     >                 at 
org.apache.lucene.codecs.lucene41.ForUtil.readBlock(ForUtil.java:196)
    >     >                 at 
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$EverythingEnum.refillPositions(Lucene41PostingsReader.java:1284)
    >     >                 at 
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$EverythingEnum.skipPositions(Lucene41PostingsReader.java:1505)
    >     >                 at 
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$EverythingEnum.nextPosition(Lucene41PostingsReader.java:1548)
    >     >                 at 
org.apache.lucene.search.spans.TermSpans.skipTo(TermSpans.java:82)
    >     >                 at 
org.apache.lucene.search.spans.SpanScorer.advance(SpanScorer.java:63)
    >     >                 at 
org.apache.lucene.search.ConjunctionScorer.doNext(ConjunctionScorer.java:69)
    >     >                 at 
org.apache.lucene.search.ConjunctionScorer.nextDoc(ConjunctionScorer.java:100)
    >     >                 at 
org.apache.lucene.search.Scorer.score(Scorer.java:64)
    >     >                 at 
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:627)
    >     >                 at com.xhive.lucene.executor.f.a(xdb:158)
    >     >                 at com.xhive.lucene.executor.f.search(xdb:145)
    >     >                 at com.xhive.lucene.subpath.e.a(xdb:313)
    >     >                 at com.xhive.lucene.subpath.e.a(xdb:264)
    >     >                 at com.xhive.lucene.subpath.e.a(xdb:183)
    >     >                 at 
com.xhive.lucene.executor.v.executeExternally(xdb:253)
    >     >                 at 
com.xhive.kernel.ay.externalIndexExecute(xdb:2791)
    >     >                 at 
com.xhive.core.index.ExternalIndex.executeExternally(xdb:485)
    >     >                 at 
com.xhive.core.index.XhiveMultiPathIndex.a(xdb:306)
    >     >                 at com.xhive.xquery.pathexpr.v$a.ci(xdb:124)
    >     >                 at com.xhive.xquery.pathexpr.ad$a.cp(xdb:104)
    >     >                 at com.xhive.xquery.pathexpr.ax.awP(xdb:39)
    >     >                 at com.xhive.xquery.pathexpr.ax.<init>(xdb:32)
    >     >                 at com.xhive.xquery.pathexpr.av.a(xdb:424)
    >     >                 at com.xhive.xquery.pathexpr.al$a.awk(xdb:61)
    >     >                 at com.xhive.xquery.pathexpr.ag.awj(xdb:28)
    >     >                 at com.xhive.xquery.pathexpr.al.Xo(xdb:26)
    >     >                 at com.xhive.xquery.pathexpr.aj.<init>(xdb:33)
    >     >                 at com.xhive.xquery.pathexpr.al.<init>(xdb:20)
    >     >                 at com.xhive.xquery.pathexpr.av.a(xdb:462)
    >     >                 at com.xhive.xquery.pathexpr.av.a(xdb:413)
    >     >                 at com.xhive.xquery.pathexpr.av.a(xdb:276)
    >     >                 at com.xhive.xquery.pathexpr.av.a(xdb:220)
    >     >
    >     >
    >     > ==============================================================
    >     > following is CheckIndex output of corrupted segment. The full 
output is attached.
    >     >
    >     >
    >     > Checking consistency of: [CHECK_INDEXES_CONSISTENCY]
    >     > Library child /dpwprd/dsearch/Data/Collection2 is not in consistent 
state, errors report:
    >     > ============================================================
    >     > Library child name=/dpwprd/dsearch/Data/Collection2 indexes
    >     > consistency report.
    >     > ============================================================
    >     > check external index consistency [database name: xhivedb;
    >     >
    >     > index name: dmftdoc; segment id:
    >     >
    >     > EI-0ab89c0c-2a9d-4fe2-97b9-5f0c96678f13-510173395289107-master;
    >     >
    >     > xhive index id id: 510173395289107]
    >     > check lucene indices
    >     >
    >     > fail: lucene index LI-0001cd61-342c-4cfe-9898-c293eb1c8c09
    >     >
    >     > is not consistent; Segments file=segments_2 numSegments=5
    >     >
    >     > version=4.5.1 format=
    >     >   1 of 5: name=_0 docCount=8341939
    >     >
    >     >
    >     >    codec=Lucene45
    >     >     compound=false
    >     >     numFiles=26
    >     >
    >     >
    >     > size (MB)=16,446.152
    >     >     diagnostics =
    >     >
    >     > {timestamp=1514627603337, mergeFactor=6, os.version=6.1,
    >     >
    >     > os=Windows Server 2008 R2, lucene.version=4.5.1 1533280 -
    >     >
    >     > mark - 2013-10-17 21:37:01, source=merge, os.arch=amd64,
    >     >
    >     > mergeMaxNumSegments=5, java.version=1.7.0_80,
    >     >
    >     > java.vendor=Oracle Corporation}
    >     >     has deletions
    >     >
    >     > [delGen=70]
    >     >     test: open reader.........OK [2295 deleted
    >     >
    >     > docs]
    >     >     test: fields..............OK [268 fields]
    >     >
    >     >
    >     > test: field norms.........OK [3 fields]
    >     >     test: terms,
    >     >
    >     > freq, prox...ERROR:
    >     >
    >     > java.lang.ArrayIndexOutOfBoundsException
    >     >
    >     >
    >     > java.lang.ArrayIndexOutOfBoundsException
    >     >     test: stored
    >     >
    >     > fields.......OK [16679288 total field count; avg 2 fields
    >     >
    >     > per doc]
    >     >     test: term vectors........OK [0 total vector
    >     >
    >     > count; avg 0 term/freq vector fields per doc]
    >     >     test:
    >     >
    >     > docvalues...........OK [0 docvalues fields; 0 BINARY; 0
    >     >
    >     > NUMERIC; 0 SORTED; 0 SORTED_SET]
    >     > FAILED
    >     >     WARNING:
    >     >
    >     > fixIndex() would remove reference to this segment; full
    >     >
    >     > exception:
    >     > java.lang.RuntimeException: Term Index test
    >     >
    >     > failed
    >     >                 at
    >     >
    >     > org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:638)
    >     >
    >     >
    >     >                 at
    >     >
    >     > org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:372)
    >     >
    >     >
    >     >                 at com.xhive.lucene.executor.j.a(xdb:1190)
    >     >                 at
    >     >
    >     > com.xhive.lucene.executor.j.aY(xdb:1166)
    >     >                 at
    >     >
    >     > com.xhive.lucene.executor.v.checkIndexConsistency(xdb:370)
    >     >
    >     >
    >     >                 at
    >     >
    >     > com.xhive.kernel.ay.externalIndexCheckConsistency(xdb:2523)
    >     >
    >     >
    >     >                 at com.xhive.kernel.bn.handleRequest(xdb:2544)
    >     >                 at
    >     >
    >     > com.xhive.kernel.bn.run(xdb:222)
    >     >                 at
    >     >
    >     > java.lang.Thread.run(Thread.java:745)
    >     >
    >     > ==============================================================
    >     >
    >     > The corrupted payload stores a serialized hashmap which contains 
several configurable metadata which is used to sort by condition.
    >     > The field of the corrupted payload is single term field, so the 
structure of posting looks like a sequence of payload.
    >     > We also put freshness boost value into payload in another field, 
which have no issues.
    >     >
    >     > It is the first customer report the corruption after we used Lucene 
4.5.1 and released our product for many years.
    >     >
    >     > Please let me know if you have any idea to this issue.
    >     >
    >     > Thanks,
    >     > Tony Ma(马江)
    >     >
    >
    >     ---------------------------------------------------------------------
    >     To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
    >     For additional commands, e-mail: java-user-h...@lucene.apache.org
    >
    >
    >
    >
    > ---------------------------------------------------------------------
    > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
    > For additional commands, e-mail: java-user-h...@lucene.apache.org
    >
    
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
    For additional commands, e-mail: java-user-h...@lucene.apache.org
    
    

Reply via email to