anuragrai16 opened a new pull request, #17264: URL: https://github.com/apache/pinot/pull/17264
This PR introduces a dataCrc field that captures a CRC checksum computed only over core data files (forward indices, dictionaries, inverted indices, and metadata.properties), excluding auxiliary indexes like star-tree, text indexes, and bloom filters. This enables better tracking of actual data changes versus index changes, which is critical for upsert table consistency validation and replication verification. Note that this PR only adds the change to add computation of data-only CRC and changes for reading and pushing it to ZK. The CRC is not used anywhere to perform segment comparison. The changes for that will be added in the next PR. ## Motivation https://github.com/apache/pinot/issues/17262 **Code Changes** `CrcUtils.java:` Added logic to identify and separately track data files vs. all files, supporting computeCrc(boolean dataOnly) to compute either data-only or full CRC. `SegmentIndexCreationDriverImpl.java`: Modified segment creation to compute data CRC before format conversion and full CRC after all indexes are built. Updated persistCreationMeta() to write three long values: [crc, creationTime, dataCrc]. `SegmentMetadataImpl.java`: Added _dataCrc field and modified loadCreationMeta() to read the third long value with backward compatibility handling (catches IOException for old segments without dataCrc). `SegmentZKMetadataUtils.java`: Already publishes dataCrc to ZooKeeper (existing code at line 163), ensuring all segment uploads include the new field. **Backward Compatibility** Backward compatible. Old segments without dataCrc default to Long.MIN_VALUE (displayed as -1 in ZK), while new segments include the computed value. All segment creation and upload paths verified. **Testing** -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
