mikemccand opened a new issue, #15552: URL: https://github.com/apache/lucene/issues/15552
### Description Lucene uses CRC32 for its end-to-end checksumming. Every index file records its own checksum on write, and when lighting a new segment in `IndexReader` we also validate when we can. `CheckIndex` validates all files. `IndexWriter` validates source segment files when merging. It's awesome, it catches insidious bit flips for [those people still not using ECC RAM](https://www.reddit.com/r/hardware/comments/y5yroa/linus_tolvards_is_upgrading_his_computer_with_ecc/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button). I know at least @rmuir and myself and maybe @uschindler (?) have been hit by intermittent bad RAM. Once it strikes you personally you will never again tolerate non-ECC RAM on your dev boxes hah. At Amazon customer facing product search, we validate Lucene's checksum through each step of our near-real-time segment replication (via S3), so we can (hopefully -- there are still technically vulnerabilities if you have adversarial bit-flipping RAM monster lurking in your box (see visuals from [Gemini](https://gemini.google.com/share/0d38531a04ce) and [Grok](https://grok.com/imagine/post/6dfce282-2288-4c8d-a2da-b4d61ecc0a5a?source=copy_link&platform=ios&t=1dce97118e9c))) prevent any bit flips from metastasizing into our persistent S3 index snapshots. This really matters at Amazon's crazy scale (~ 10s of PiB replicated per day to/from S3 -- many chances for errant bit flips!). S3 write/read also does its own checksumming, but that won't catch bit flips on the `IndexWriter` node before a file is uploaded, so we use/validate both checksums. One of the delightful properties of CRC32 (thank you MATH!) is you can map/reduce it! I.e. slice up a large (5 GB is Lucene's default max merged segment) segment into N chunks, compute CRC32 of each chunk concurrently, and then use `crc32_combine` zlib API to merge the N CRC32s into a single CRC32 that matches the whole file's sequential CRC32 checksum. It's quite simple (says [Claude](https://claude.ai/share/e2459acc-a812-4a21-9489-680c50a29471) and [Grok](https://grok.com/share/c2hhcmQtMw_610209f9-b4bd-41fe-9e90-b11f5fae9395)). This would be awesome because then users like us could use chunk'd upload/download to improve aggregate S3 throughput and lower the nrt refresh latency during replication. But, annoyingly, it looks like JDK's CRC32 implementation (`java.util.zip.CRC32`) does not expose `crc32_combine`? Maybe Lucene could provide a utility class to make this simple? Claude provides a [java implementation of `crc32_combine`](https://claude.ai/share/e2459acc-a812-4a21-9489-680c50a29471), perhaps it is hallucination free? Or maybe we could use FFM to access the `crc32_combine` -- but I'm not sure we can rely on it always being accessible/visible to the JVM, always. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
