chenboat opened a new issue #5960:
URL: https://github.com/apache/incubator-pinot/issues/5960


   We found during our testing that Pinot LLC realtime consumer stopped 
ingestion due to out of order delivery of Helix state transition events for 
consecutive segments of a partition. The detailed log is attached.
   
   Note that **OFFLINE->CONSUMING** for 
_terra_lever_events__0__88__20200901T2252Z_ was delivered slightly before 
**CONSUMING->ONLINE** for _terra_lever_events__0__87__20200901T1652Z_. 
   
   For  _terra_lever_events__0__88__20200901T2252Z_ to get to the consuming 
state, HelixTaskExecutor needs to [acquire the 
semaphore](https://github.com/apache/incubator-pinot/blob/6dd54f8aee0678dca0c9fb6653a7f894158996ca/pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/LLRealtimeSegmentDataManager.java#L1189)
 for the partition. However the semaphore is hold by the 
_terra_lever_events__0__87__20200901T1652Z_ which can not be released until 
HelixTaskExecutor executes its CONSUMING->ONLINE transition (which is next in 
the task queue). As a result, a deadlock occurred and the LLC stop ingestions. 
In fact the server stopped ingestion for all tables because HelixTaskExecutor 
has java.util.concurrent.ThreadPoolExecutor default pool size = 1 and active 
threads = 1.   
   
   ```
   2020-09-01 22:52:03.795 [terra_lever_events__0__87__20200901T1652Z] INFO  
org.apache.pinot.common.utils.FileUploadDownloadClient  - Sending request: 
http://streampinot-canary01-dca1:7936/segmentConsumed?name=terra_lever_events__0__87__20200901T1652Z&offset=150641945&instance=Server_streampinot-canary04-dca1_7090&reason=timeLimit&memoryUsedBytes=63737644&rowCount=265476
 to controller: localhost, version: Unknown
   2020-09-01 22:52:03.795 [terra_lever_events__0__87__20200901T1652Z] INFO  
o.a.p.server.realtime.ServerSegmentCompletionProtocolHandler  - Controller 
response 
{"buildTimeSec":-1,"isSplitCommitType":false,"status":"HOLD","offset":150641945}
 for 
http://streampinot-canary01-dca1:7936/segmentConsumed?name=terra_lever_events__0__87__20200901T1652Z&offset=150641945&instance=Server_streampinot-canary04-dca1_7090&reason=timeLimit&memoryUsedBytes=63737644&rowCount=265476
   2020-09-01 22:52:06.801 [terra_lever_events__0__87__20200901T1652Z] INFO  
org.apache.pinot.common.utils.FileUploadDownloadClient  - Sending request: 
http://streampinot-canary01-dca1:7936/segmentConsumed?name=terra_lever_events__0__87__20200901T1652Z&offset=150641945&instance=Server_streampinot-canary04-dca1_7090&reason=timeLimit&memoryUsedBytes=63737644&rowCount=265476
 to controller: localhost, version: Unknown
   2020-09-01 22:52:06.801 [terra_lever_events__0__87__20200901T1652Z] INFO  
o.a.p.server.realtime.ServerSegmentCompletionProtocolHandler  - Controller 
response 
{"buildTimeSec":-1,"isSplitCommitType":false,"status":"DISCARD","offset":-1} 
for 
http://streampinot-canary01-dca1:7936/segmentConsumed?name=terra_lever_events__0__87__20200901T1652Z&offset=150641945&instance=Server_streampinot-canary04-dca1_7090&reason=timeLimit&memoryUsedBytes=63737644&rowCount=265476
   2020-09-01 22:52:06.856 
[ZkClient-EventThread-66-streampinot-stgzk01-dca1,streampinot-stgzk02-dca1,streampinot-stgzk03-dca1/pinot/pinot-canary]
 INFO  org.apache.helix.messaging.handling.HelixTaskExecutor  - Scheduling 
message a195fc82-b406-4de4-9008-7d35a901aaf7: 
terra_lever_events_REALTIME:terra_lever_events__0__88__20200901T2252Z, 
OFFLINE->CONSUMING
   2020-09-01 22:52:06.856 
[ZkClient-EventThread-66-streampinot-stgzk01-dca1,streampinot-stgzk02-dca1,streampinot-stgzk03-dca1/pinot/pinot-canary]
 INFO  org.apache.helix.messaging.handling.HelixTaskExecutor  - Scheduling 
message 5c856925-0bb0-4422-8f3b-ab8fffe26e21: 
terra_lever_events_REALTIME:terra_lever_events__0__87__20200901T1652Z, 
CONSUMING->ONLINE
   2020-09-01 22:52:06.857 [HelixTaskExecutor-message_handle_thread] INFO  
o.a.helix.messaging.handling.HelixStateTransitionHandler  - handling message: 
a195fc82-b406-4de4-9008-7d35a901aaf7 transit 
terra_lever_events_REALTIME.terra_lever_events__0__88__20200901T2252Z|[] 
from:OFFLINE to:CONSUMING, relayedFrom: null
   2020-09-01 22:52:06.865 [HelixTaskExecutor-message_handle_thread] INFO  
o.a.helix.messaging.handling.HelixStateTransitionHandler  - Instance 
Server_streampinot-canary04-dca1_7090, partition 
terra_lever_events__0__88__20200901T2252Z received state transition from 
OFFLINE to CONSUMING on session 36b963066e477de, message id: 
a195fc82-b406-4de4-9008-7d35a901aaf7
   2020-09-01 22:52:06.865 [HelixTaskExecutor-message_handle_thread] INFO  
S.a.p.s.s.h.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel
  - SegmentOnlineOfflineStateModel.onBecomeConsumingFromOffline() : 
ZnRecord=a195fc82-b406-4de4-9008-7d35a901aaf7, {CREATE_TIMESTAMP=1599000726835, 
ClusterEventName=IdealStateChange, EXECUTE_START_TIMESTAMP=1599000726857, 
EXE_SESSION_ID=36b963066e477de, FROM_STATE=OFFLINE, 
MSG_ID=a195fc82-b406-4de4-9008-7d35a901aaf7, MSG_STATE=read, 
MSG_TYPE=STATE_TRANSITION, 
PARTITION_NAME=terra_lever_events__0__88__20200901T2252Z, 
READ_TIMESTAMP=1599000726851, RESOURCE_NAME=terra_lever_events_REALTIME, 
RESOURCE_TAG=terra_lever_events_REALTIME, 
SRC_NAME=streampinot-canary01-dca1_7936, SRC_SESSION_ID=26b96306664764b, 
STATE_MODEL_DEF=SegmentOnlineOfflineStateModel, 
STATE_MODEL_FACTORY_NAME=DEFAULT, 
TGT_NAME=Server_streampinot-canary04-dca1_7090, TGT_SESSION_ID=36b963066e477de, 
TO_STATE=CONSUMING}{}{}, Stat=Stat {_version=0, _creationTime=15990007268
 43, _modifiedTime=1599000726843, _ephemeralOwner=0}
   2020-09-01 22:52:06.865 [HelixTaskExecutor-message_handle_thread] INFO  
S.a.p.s.s.h.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel
  - SegmentOnlineOfflineStateModel.onBecomeOnlineFromOffline() : 
ZnRecord=a195fc82-b406-4de4-9008-7d35a901aaf7, {CREATE_TIMESTAMP=1599000726835, 
ClusterEventName=IdealStateChange, EXECUTE_START_TIMESTAMP=1599000726857, 
EXE_SESSION_ID=36b963066e477de, FROM_STATE=OFFLINE, 
MSG_ID=a195fc82-b406-4de4-9008-7d35a901aaf7, MSG_STATE=read, 
MSG_TYPE=STATE_TRANSITION, 
PARTITION_NAME=terra_lever_events__0__88__20200901T2252Z, 
READ_TIMESTAMP=1599000726851, RESOURCE_NAME=terra_lever_events_REALTIME, 
RESOURCE_TAG=terra_lever_events_REALTIME, 
SRC_NAME=streampinot-canary01-dca1_7936, SRC_SESSION_ID=26b96306664764b, 
STATE_MODEL_DEF=SegmentOnlineOfflineStateModel, 
STATE_MODEL_FACTORY_NAME=DEFAULT, 
TGT_NAME=Server_streampinot-canary04-dca1_7090, TGT_SESSION_ID=36b963066e477de, 
TO_STATE=CONSUMING}{}{}, Stat=Stat {_version=0, _creationTime=1599000726843,
  _modifiedTime=1599000726843, _ephemeralOwner=0}
   2020-09-01 22:52:06.865 [HelixTaskExecutor-message_handle_thread] INFO  
o.apache.pinot.server.starter.helix.HelixInstanceDataManager  - Adding segment: 
terra_lever_events__0__88__20200901T2252Z to table: terra_lever_events_REALTIME
   2020-09-01 22:52:06.878 [HelixTaskExecutor-message_handle_thread] INFO  
o.a.p.c.d.m.r.LLRealtimeSegmentDataManager_terra_lever_events__0__88__20200901T2252Z
  - RealtimeDataResourceZKMetadata contains no information about sorted column 
for segment terra_lever_events__0__88__20200901
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to