itschrispeck opened a new issue, #11004: URL: https://github.com/apache/pinot/issues/11004
When restarting servers in our clusters running on 0.12+, we observe numerous Helix pending messages due to `Failed to Load LLC Segment`. This is caused by CRC mismatch between ZK and the starting server. Initial investigation showed that the 'leader' server commits and updates ZK, while the 'follower' server catches up and builds the segment locally with a different CRC. When a server is restarted all the segments that it 'followed' fail to load and are redownloaded from our deep store. We've confirmed that startOffset/endOffset match and the difference between two segments lies in the `columns.psf` file. Logs for a segment: ``` 2023-06-29T13:45:25-07:00 [host] Adding segment: <segment_name> to table: <table_name> 2023-06-29T13:45:25-07:00 [host] Segment: <segment_name> of table: <table_name> has crc change from: 3828021013 to: 1125725625 2023-06-29T13:45:25-07:00 [host] Failed to load LLC segment: <segment_name>, downloading a new copy ``` This behavior isn't seen on our clusters running on a 0.11 base. Is it possible some non-deterministic was introduced in the segment build process? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
