tibrewalpratik17 commented on issue #13140: URL: https://github.com/apache/pinot/issues/13140#issuecomment-2120131734
One potential root-cause for this might be releasing the `_partitionGroupConsumerSemaphore` before the segment actually came `ONLINE`. This can result in starting the consumption on new segment. For example: - Segment S1 was in CONSUMING state and is now getting moved to ONLINE. - Segment S2 is the new segment which is waiting for acquiring the `_partitionGroupConsumerSemaphore` -- https://github.com/apache/pinot/blob/a385e28c3d8f5175eaca621f59debc2d8f83ab56/pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/RealtimeSegmentDataManager.java#L1563 - Segment S1 starts building the same segment and releases `_partitionGroupConsumerSemaphore` https://github.com/apache/pinot/blob/a385e28c3d8f5175eaca621f59debc2d8f83ab56/pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/RealtimeSegmentDataManager.java#L964-L965 or it issues a downloadSegmentAndReplace step again releasing the semaphore first https://github.com/apache/pinot/blob/a385e28c3d8f5175eaca621f59debc2d8f83ab56/pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/RealtimeSegmentDataManager.java#L1320-L1322 In both the scenarios, S2 will start the consumption and in another thread, S1 will be getting built. - Now in this particular issue reported, what we were observing is that the `merge` function in S2 was treating the incoming record as a new record and not picking up the changes already came in S1. This might be because the records never came in S1 and the consumption was stopped because it never caught up to the committing offset. If you see the ingestion throughput is pretty high here (15k msgs/sec) and all the records for a given key is coming in a very small space of time. So if S1 did not catch up to the first offset of a given key K1 in one of the replicas and that replica starts getting replaced, S2 in the same replica will treat it as a new key and persist it. And now S1 will report the keys as `not_replaced` which we were seeing in our metrics as well. Question to @Jackie-Jiang and @klsince , can we move the `closeStreamConsumers` call after building the consuming segment? Do you know why we went with this logic in the first place? Note: This will not affect realtime and full-upsert cases to a lot extent but becomes very critical for partial-upsert cases where replica will diverge over time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
