[ 
https://issues.apache.org/jira/browse/LUCENE-5842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071686#comment-14071686
 ] 

Robert Muir commented on LUCENE-5842:
-------------------------------------

By the way, as a followup, we can do even better and iterate a bit more:

Today each file by itself can be 'correct' but you still have a corrupt index 
because the files are mismatched somehow (network replication, or some other 
bug).

it might be worth thinking about reviving segmentinfo.attributes (thats 
cleanest i think), or put in files map directly (would be harder as it enforces 
files have checksums). We could store each files checksum there, and when we 
retrieve it here, validate against that attribute. This would detect 
mismatching. 

Ideally though we'd do this for the commit too (for deletes and dv updates). 

Anyway just something to explore on another issue if we can do it without 
creating a mess. I don't like how we cant detect such mismatching today (except 
via very rudimentary checks like livedocs.length = maxdoc etc).


> Validate checksum footers for postings lists, docvalues, storedfields, 
> termvectors on init
> ------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-5842
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5842
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: LUCENE-5842.patch
>
>
> For small files (e.g. where we read in all the bytes anyway), we currently 
> validate the checksum on reader init. 
> But for larger files like .doc/.frq/.pos/.dvd/.fdt/.tvd we currently do 
> nothing at all on init, as it would be too expensive.
> We should at least do this:
> {code}
> // NOTE: data file is too costly to verify checksum against all the bytes on 
> // open, but for now we at least verify proper structure of the checksum 
> // footer: which looks for FOOTER_MAGIC + algorithmID. This is cheap 
> // and can detect some forms of corruption such as file truncation.
> CodecUtil.retrieveChecksum(data);
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to