itschrispeck opened a new issue, #11004:
URL: https://github.com/apache/pinot/issues/11004

   When restarting servers in our clusters running on 0.12+, we observe 
numerous Helix pending messages due to `Failed to Load LLC Segment`. This is 
caused by CRC mismatch between ZK and the starting server. 
   
   Initial investigation showed that the 'leader' server commits and updates 
ZK, while the 'follower' server catches up and builds the segment locally with 
a different CRC. When a server is restarted all the segments that it 'followed' 
fail to load and are redownloaded from our deep store. We've confirmed that 
startOffset/endOffset match and the difference between two segments lies in the 
`columns.psf` file.
   
   Logs for a segment: 
   ```
   2023-06-29T13:45:25-07:00 [host] Adding segment: <segment_name> to table: 
<table_name>
   2023-06-29T13:45:25-07:00 [host] Segment: <segment_name> of table: 
<table_name> has crc change from: 3828021013 to: 1125725625
   2023-06-29T13:45:25-07:00 [host] Failed to load LLC segment: <segment_name>, 
downloading a new copy
   ```
   
   This behavior isn't seen on our clusters running on a 0.11 base. Is it 
possible some non-deterministic was introduced in the segment build process? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to