Re: [I] Provide helper utility API to compute CRC32 across concurrent (map/reduce) chunks [lucene]

via GitHub Tue, 20 Jan 2026 06:39:52 -0800


mikemccand commented on issue #15552:
URL: https://github.com/apache/lucene/issues/15552#issuecomment-3773216676


   > Also related to this, at some point we should double-check the math. IIRC 
these algorithms are only appropriate for file sizes up to some certain limit 
(e.g. some number of GB). Otherwise they may not detect problems reliably.
   
   I tried to [ask Claude Opus 4.5 about 
this](https://claude.ai/share/91bf5b0b-3d76-4fbc-aad4-4893cc0168e8) but its 
response is confusing :)  I think the problem is "false negative" risk meaning 
bit(s) flipped but the bit-flipped file has the same checksum as the original 
file ("checksum collision") so the error goes undetected.  The risk of this is 
~1 in ~4.3 billion (2 ^ 32) assuming your bit flips have no accidental (or 
adversarial -- this is not a secure/crypto hash?) correlation with CRC32's 
collisions.
   
   But this is a risk per checksum+validation right, not "per GB of file you 
are checksumming" or so?  So, as long as Lucene isn't using billions of files 
in an index, the risk remains lowish for any single Lucene user?
   
   So, of the billions of Lucene users ;) (well, individual index files 
written/read times all Lucene usage integrated over time since we added 
checksums), some have probably hit this false negative!  Oh wait, the universe 
is smaller -- it's only those segment files written with a bit-flipper in the 
path?
   
   Still, for such users, it's likely their bit-flipper (wherever it is -- RAM, 
bus, storage, CPU cache lines) will still be detected even if they unluckily 
hit the jackpot once (checkpoint collision).  Weird/scary/hard to think about...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Provide helper utility API to compute CRC32 across concurrent (map/reduce) chunks [lucene]

Reply via email to