chenboat opened a new issue #4626: Low level realtime consumer (LLC) got into 
ERROR state due to thread race condition.
URL: https://github.com/apache/incubator-pinot/issues/4626
 
 
   Recently we observed LLC realtime consumer got into ERROR state during data 
consumption. We are running on mid April Pinot (5be8431d6a49) but by code 
inspection, the issue could also occur on the latest code.
   
    In a high level timeline, 
      (1) At the end of segment completion protocol, the controller asked the 
server to keep its segment and go online. 
      (2) the consumer thread T, after receiving the KEEP response, tried to 
build the segment but stuck in acquiring semaphore.
      (3) the main thread in the server received OnlineFromConsuming transition 
message from the helix. It then tried to stop the consumer thread T in (2) and 
waited for 10 mins but the consumer thread did not stop because it is waiting 
for semaphore. Then in RETAINNG state, the main thread chose to download the 
segment and go online.
      (4) Now there are two threads both trying to write to the final segment 
directory and caused file overwrite ERROR.
   
   The detailed logs are attached below. Here some observations about the 
current codes and some fix ideas:
   (1) In buildSegmentInternal() of LLRealtimeSegmentDataManager, _shouldStop 
is not checked after long ops like semaphore acquisition and segment build.
   
https://github.com/apache/incubator-pinot/blob/c0dbbfc81c1fa0d6be78c1b95448fff96803f0c6/pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/LLRealtimeSegmentDataManager.java#L673-L681
   
   If the method re-checked the _shouldStop state after potential lengthy ops 
like acquireSemaphore() and buildSegment(), the PartitionConsumer thread can 
just stop as instructed by the main thread -- and there will be no overwriting 
issue. This fix alone could already fix the issue.
   
   (2) The main thread chose to download and replace segment in RETAINING state 
-- this is not consistent with the comment below. 
   
   
https://github.com/apache/incubator-pinot/blob/c0dbbfc81c1fa0d6be78c1b95448fff96803f0c6/pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/LLRealtimeSegmentDataManager.java#L106-L108
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to