chenboat opened a new issue #4626: Low level realtime consumer (LLC) got into ERROR state due to thread race condition. URL: https://github.com/apache/incubator-pinot/issues/4626 Recently we observed LLC realtime consumer got into ERROR state during data consumption. We are running on mid April Pinot (5be8431d6a49) but by code inspection, the issue could also occur on the latest code. In a high level timeline, (1) At the end of segment completion protocol, the controller asked the server to keep its segment and go online. (2) the consumer thread T, after receiving the KEEP response, tried to build the segment but stuck in acquiring semaphore. (3) the main thread in the server received OnlineFromConsuming transition message from the helix. It then tried to stop the consumer thread T in (2) and waited for 10 mins but the consumer thread did not stop because it is waiting for semaphore. Then in RETAINNG state, the main thread chose to download the segment and go online. (4) Now there are two threads both trying to write to the final segment directory and caused file overwrite ERROR. The detailed logs are attached below. Here some observations about the current codes and some fix ideas: (1) In buildSegmentInternal() of LLRealtimeSegmentDataManager, _shouldStop is not checked after long ops like semaphore acquisition and segment build. https://github.com/apache/incubator-pinot/blob/c0dbbfc81c1fa0d6be78c1b95448fff96803f0c6/pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/LLRealtimeSegmentDataManager.java#L673-L681 If the method re-checked the _shouldStop state after potential lengthy ops like acquireSemaphore() and buildSegment(), the PartitionConsumer thread can just stop as instructed by the main thread -- and there will be no overwriting issue. This fix alone could already fix the issue. (2) The main thread chose to download and replace segment in RETAINING state -- this is not consistent with the comment below. https://github.com/apache/incubator-pinot/blob/c0dbbfc81c1fa0d6be78c1b95448fff96803f0c6/pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/LLRealtimeSegmentDataManager.java#L106-L108
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
