[
https://issues.apache.org/jira/browse/LUCENE-5580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962741#comment-13962741
]
Michael McCandless commented on LUCENE-5580:
--------------------------------------------
+1 to very the checksum on the fly without reading the file twice, and the
patch looks good.
We could pull that anonymous BufferedChecksumIndexInput subclass out (e.g.,
ForwardOnlySeekingChecksum... or something) and CompressingTermVectors could do
the same thing? Other non-bulk-copying components could also use it, e.g. I
think when merging postings we read nearly the entire file already (no actual
seeking)...
We can do that in a separate issue.
> Always verify stored fields' checksum on merge
> ----------------------------------------------
>
> Key: LUCENE-5580
> URL: https://issues.apache.org/jira/browse/LUCENE-5580
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Fix For: 4.8
>
> Attachments: LUCENE-5580.patch
>
>
> I have seen a couple of index corruptions over the last months, and most of
> them happened on stored fields. The explanation might just be that since
> stored fields are usually most of the index size, they are just more likely
> to be corrupted due to a hardware/operating-system failure, but it might be
> as well a sneaky bug on our side.
> Lucene recently added checksums to index files, and you can enable integrity
> verification upon merge, but this comes with a cost since you need to read
> all index files twice instead of once. If you are merging a very large
> segment and your merges are I/O-bound, this might be noticeable.
> I would like to implement integrity checks for stored fields on merges on the
> fly, so that the stored fields files need to be read only once.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]