chenboat opened a new issue #5960: URL: https://github.com/apache/incubator-pinot/issues/5960
We found during our testing that Pinot LLC realtime consumer stopped ingestion due to out of order delivery of Helix state transition events for consecutive segments of a partition. The detailed log is attached. Note that **OFFLINE->CONSUMING** for _terra_lever_events__0__88__20200901T2252Z_ was delivered slightly before **CONSUMING->ONLINE** for _terra_lever_events__0__87__20200901T1652Z_. For _terra_lever_events__0__88__20200901T2252Z_ to get to the consuming state, HelixTaskExecutor needs to [acquire the semaphore](https://github.com/apache/incubator-pinot/blob/6dd54f8aee0678dca0c9fb6653a7f894158996ca/pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/LLRealtimeSegmentDataManager.java#L1189) for the partition. However the semaphore is hold by the _terra_lever_events__0__87__20200901T1652Z_ which can not be released until HelixTaskExecutor executes its CONSUMING->ONLINE transition (which is next in the task queue). As a result, a deadlock occurred and the LLC stop ingestions. In fact the server stopped ingestion for all tables because HelixTaskExecutor has java.util.concurrent.ThreadPoolExecutor default pool size = 1 and active threads = 1. ``` 2020-09-01 22:52:03.795 [terra_lever_events__0__87__20200901T1652Z] INFO org.apache.pinot.common.utils.FileUploadDownloadClient - Sending request: http://streampinot-canary01-dca1:7936/segmentConsumed?name=terra_lever_events__0__87__20200901T1652Z&offset=150641945&instance=Server_streampinot-canary04-dca1_7090&reason=timeLimit&memoryUsedBytes=63737644&rowCount=265476 to controller: localhost, version: Unknown 2020-09-01 22:52:03.795 [terra_lever_events__0__87__20200901T1652Z] INFO o.a.p.server.realtime.ServerSegmentCompletionProtocolHandler - Controller response {"buildTimeSec":-1,"isSplitCommitType":false,"status":"HOLD","offset":150641945} for http://streampinot-canary01-dca1:7936/segmentConsumed?name=terra_lever_events__0__87__20200901T1652Z&offset=150641945&instance=Server_streampinot-canary04-dca1_7090&reason=timeLimit&memoryUsedBytes=63737644&rowCount=265476 2020-09-01 22:52:06.801 [terra_lever_events__0__87__20200901T1652Z] INFO org.apache.pinot.common.utils.FileUploadDownloadClient - Sending request: http://streampinot-canary01-dca1:7936/segmentConsumed?name=terra_lever_events__0__87__20200901T1652Z&offset=150641945&instance=Server_streampinot-canary04-dca1_7090&reason=timeLimit&memoryUsedBytes=63737644&rowCount=265476 to controller: localhost, version: Unknown 2020-09-01 22:52:06.801 [terra_lever_events__0__87__20200901T1652Z] INFO o.a.p.server.realtime.ServerSegmentCompletionProtocolHandler - Controller response {"buildTimeSec":-1,"isSplitCommitType":false,"status":"DISCARD","offset":-1} for http://streampinot-canary01-dca1:7936/segmentConsumed?name=terra_lever_events__0__87__20200901T1652Z&offset=150641945&instance=Server_streampinot-canary04-dca1_7090&reason=timeLimit&memoryUsedBytes=63737644&rowCount=265476 2020-09-01 22:52:06.856 [ZkClient-EventThread-66-streampinot-stgzk01-dca1,streampinot-stgzk02-dca1,streampinot-stgzk03-dca1/pinot/pinot-canary] INFO org.apache.helix.messaging.handling.HelixTaskExecutor - Scheduling message a195fc82-b406-4de4-9008-7d35a901aaf7: terra_lever_events_REALTIME:terra_lever_events__0__88__20200901T2252Z, OFFLINE->CONSUMING 2020-09-01 22:52:06.856 [ZkClient-EventThread-66-streampinot-stgzk01-dca1,streampinot-stgzk02-dca1,streampinot-stgzk03-dca1/pinot/pinot-canary] INFO org.apache.helix.messaging.handling.HelixTaskExecutor - Scheduling message 5c856925-0bb0-4422-8f3b-ab8fffe26e21: terra_lever_events_REALTIME:terra_lever_events__0__87__20200901T1652Z, CONSUMING->ONLINE 2020-09-01 22:52:06.857 [HelixTaskExecutor-message_handle_thread] INFO o.a.helix.messaging.handling.HelixStateTransitionHandler - handling message: a195fc82-b406-4de4-9008-7d35a901aaf7 transit terra_lever_events_REALTIME.terra_lever_events__0__88__20200901T2252Z|[] from:OFFLINE to:CONSUMING, relayedFrom: null 2020-09-01 22:52:06.865 [HelixTaskExecutor-message_handle_thread] INFO o.a.helix.messaging.handling.HelixStateTransitionHandler - Instance Server_streampinot-canary04-dca1_7090, partition terra_lever_events__0__88__20200901T2252Z received state transition from OFFLINE to CONSUMING on session 36b963066e477de, message id: a195fc82-b406-4de4-9008-7d35a901aaf7 2020-09-01 22:52:06.865 [HelixTaskExecutor-message_handle_thread] INFO S.a.p.s.s.h.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel - SegmentOnlineOfflineStateModel.onBecomeConsumingFromOffline() : ZnRecord=a195fc82-b406-4de4-9008-7d35a901aaf7, {CREATE_TIMESTAMP=1599000726835, ClusterEventName=IdealStateChange, EXECUTE_START_TIMESTAMP=1599000726857, EXE_SESSION_ID=36b963066e477de, FROM_STATE=OFFLINE, MSG_ID=a195fc82-b406-4de4-9008-7d35a901aaf7, MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=terra_lever_events__0__88__20200901T2252Z, READ_TIMESTAMP=1599000726851, RESOURCE_NAME=terra_lever_events_REALTIME, RESOURCE_TAG=terra_lever_events_REALTIME, SRC_NAME=streampinot-canary01-dca1_7936, SRC_SESSION_ID=26b96306664764b, STATE_MODEL_DEF=SegmentOnlineOfflineStateModel, STATE_MODEL_FACTORY_NAME=DEFAULT, TGT_NAME=Server_streampinot-canary04-dca1_7090, TGT_SESSION_ID=36b963066e477de, TO_STATE=CONSUMING}{}{}, Stat=Stat {_version=0, _creationTime=15990007268 43, _modifiedTime=1599000726843, _ephemeralOwner=0} 2020-09-01 22:52:06.865 [HelixTaskExecutor-message_handle_thread] INFO S.a.p.s.s.h.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel - SegmentOnlineOfflineStateModel.onBecomeOnlineFromOffline() : ZnRecord=a195fc82-b406-4de4-9008-7d35a901aaf7, {CREATE_TIMESTAMP=1599000726835, ClusterEventName=IdealStateChange, EXECUTE_START_TIMESTAMP=1599000726857, EXE_SESSION_ID=36b963066e477de, FROM_STATE=OFFLINE, MSG_ID=a195fc82-b406-4de4-9008-7d35a901aaf7, MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=terra_lever_events__0__88__20200901T2252Z, READ_TIMESTAMP=1599000726851, RESOURCE_NAME=terra_lever_events_REALTIME, RESOURCE_TAG=terra_lever_events_REALTIME, SRC_NAME=streampinot-canary01-dca1_7936, SRC_SESSION_ID=26b96306664764b, STATE_MODEL_DEF=SegmentOnlineOfflineStateModel, STATE_MODEL_FACTORY_NAME=DEFAULT, TGT_NAME=Server_streampinot-canary04-dca1_7090, TGT_SESSION_ID=36b963066e477de, TO_STATE=CONSUMING}{}{}, Stat=Stat {_version=0, _creationTime=1599000726843, _modifiedTime=1599000726843, _ephemeralOwner=0} 2020-09-01 22:52:06.865 [HelixTaskExecutor-message_handle_thread] INFO o.apache.pinot.server.starter.helix.HelixInstanceDataManager - Adding segment: terra_lever_events__0__88__20200901T2252Z to table: terra_lever_events_REALTIME 2020-09-01 22:52:06.878 [HelixTaskExecutor-message_handle_thread] INFO o.a.p.c.d.m.r.LLRealtimeSegmentDataManager_terra_lever_events__0__88__20200901T2252Z - RealtimeDataResourceZKMetadata contains no information about sorted column for segment terra_lever_events__0__88__20200901 ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
