[I] Provide helper utility API to compute CRC32 across concurrent (map/reduce) chunks [lucene]

via GitHub Wed, 07 Jan 2026 04:23:17 -0800


mikemccand opened a new issue, #15552:
URL: https://github.com/apache/lucene/issues/15552


   ### Description
   
   Lucene uses CRC32 for its end-to-end checksumming.  Every index file records 
its own checksum on write, and when lighting a new segment in `IndexReader` we 
also validate when we can.  `CheckIndex` validates all files.  `IndexWriter` 
validates source segment files when merging.
   
   It's awesome, it catches insidious bit flips for [those people still not 
using ECC 
RAM](https://www.reddit.com/r/hardware/comments/y5yroa/linus_tolvards_is_upgrading_his_computer_with_ecc/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button).
  I know at least @rmuir and myself and maybe @uschindler (?) have been hit by 
intermittent bad RAM.  Once it strikes you personally you will never again 
tolerate non-ECC RAM on your dev boxes hah.
   
   At Amazon customer facing product search, we validate Lucene's checksum 
through each step of our near-real-time segment replication (via S3), so we can 
(hopefully -- there are still technically vulnerabilities if you have 
adversarial bit-flipping RAM monster lurking in your box (see visuals from 
[Gemini](https://gemini.google.com/share/0d38531a04ce) and 
[Grok](https://grok.com/imagine/post/6dfce282-2288-4c8d-a2da-b4d61ecc0a5a?source=copy_link&platform=ios&t=1dce97118e9c)))
 prevent any bit flips from metastasizing into our persistent S3 index 
snapshots.  This really matters at Amazon's crazy scale (~ 10s of PiB 
replicated per day to/from S3 -- many chances for errant bit flips!).  S3 
write/read also does its own checksumming, but that won't catch bit flips on 
the `IndexWriter` node before a file is uploaded, so we use/validate both 
checksums.
   
   One of the delightful properties of CRC32 (thank you MATH!) is you can 
map/reduce it!
   
   I.e. slice up a large (5 GB is Lucene's default max merged segment) segment 
into N chunks, compute CRC32 of each chunk concurrently, and then use 
`crc32_combine` zlib API to merge the N CRC32s into a single CRC32 that matches 
the whole file's sequential CRC32 checksum.  It's quite simple (says 
[Claude](https://claude.ai/share/e2459acc-a812-4a21-9489-680c50a29471) and 
[Grok](https://grok.com/share/c2hhcmQtMw_610209f9-b4bd-41fe-9e90-b11f5fae9395)).
  This would be awesome because then users like us could use chunk'd 
upload/download to improve aggregate S3 throughput and lower the nrt refresh 
latency during replication.
   
   But, annoyingly, it looks like JDK's CRC32 implementation 
(`java.util.zip.CRC32`) does not expose `crc32_combine`?
   
   Maybe Lucene could provide a utility class to make this simple?  Claude 
provides a [java implementation of 
`crc32_combine`](https://claude.ai/share/e2459acc-a812-4a21-9489-680c50a29471), 
perhaps it is hallucination free?  Or maybe we could use FFM to access the 
`crc32_combine` -- but I'm not sure we can rely on it always being 
accessible/visible to the JVM, always.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Provide helper utility API to compute CRC32 across concurrent (map/reduce) chunks [lucene]

Reply via email to