[GitHub] [druid] IgorBerman opened a new issue, #13705: Race condition when killing kafka ingestion tasks with replication > 1

via GitHub Mon, 23 Jan 2023 04:56:52 -0800


IgorBerman opened a new issue, #13705:
URL: https://github.com/apache/druid/issues/13705


   There is race condition when kafka ingestion tasks finishing. When first of 
them finishes successfully the second should be stopped. In some cases the 
second one would be killed and task marked as failed, despite the fact the 
first one is finished successfully.
   This might create false positive alerts when e.g. monitoring on failed tasks
   
   ### Affected Version
   
   24.0.1
   
   ### Description
   Hi All
   I think there is some race condition between 2 threads that finishing second 
task in Task group while first task finished successfully
   a bit of context: we using 24.0.1, ingesting from kafka topic with 2 
replicas. Using mysql as metadatastore.
   ```
   druid.coordinator.asOverlord.enabled=true
   druid.indexer.runner.type=remote
   ```
   
   here is some log extraction from coordinator(and overlord) log with my 
comments(the relevant parts will be attached in thread)
   index_kafka_sp_campaigns_realtime_aggregation_90e0b11d6585a1a_fnoejlpn 
finished successfully ->
   which triggered stopping all tasks in group(another one is  
index_kafka_sp_campaigns_realtime_aggregation_90e0b11d6585a1a_mhbgoepn
   `Stopping all tasks in taskGroup[5] because: [Task 
[index_kafka_sp_campaigns_realtime_aggregation_90e0b11d6585a1a_fnoejlpn] 
completed successfully, stopping tasks 
[index_kafka_sp_campaigns_realtime_aggregation_90e0b11d6585a1a_mhbgoepn, 
index_kafka_sp_campaigns_realtime_aggregation_90e0b11d6585a1a_fnoejlpn]]"`
   and it seems that it also finished:
   ```
   2023-01-15T11:36:01,679 INFO [Curator-PathChildrenCache-1] 
org.apache.druid.indexing.overlord.RemoteTaskRunner - 
Worker[taz-comp00674.taboolasyndication.com:8091] completed 
task[index_kafka_sp_campaigns_realtime_aggregation_90e0b11d6585a1a_mhbgoepn] 
with status[SUCCESS]
   2023-01-15T11:36:01,679 INFO [Curator-PathChildrenCache-1] 
org.apache.druid.indexing.overlord.RemoteTaskRunner - 
Worker[taz-comp00674.taboolasyndication.com:8091] wrote SUCCESS status for task 
[index_kafka_sp_campaigns_realtime_aggregation_90e0b11d6585a1a_mhbgoepn] on 
[TaskLocation{host='taz-comp00674.taboolasyndication.com', port=8102, 
tlsPort=-1}]
   2023-01-15T11:36:01,679 INFO [Curator-PathChildrenCache-1] 
org.apache.druid.indexing.overlord.TaskQueue - Received SUCCESS status for 
task: index_kafka_sp_campaigns_realtime_aggregation_90e0b11d6585a1a_mhbgoepn
   ```
   but then:
   `2023-01-15T11:36:01,680 INFO 
[IndexTaskClient-sp_campaigns_realtime_aggregation-3] 
org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - 
Task [index_kafka_sp_campaigns_realtime_aggregation_90e0b11d6585a1a_mhbgoepn] 
failed to stop in a timely manner, killing task`
   and then there is some race between 2 threads(Curator-PathChildrenCache-1 
and IndexTaskClient-sp_campaigns_realtime_aggregation-3 )
   ```
   2023-01-15T11:36:01,684 INFO [Curator-PathChildrenCache-1] 
org.apache.druid.indexing.overlord.MetadataTaskStorage - Updating task 
index_kafka_sp_campaigns_realtime_aggregation_90e0b11d6585a1a_mhbgoepn to 
status: 
TaskStatus{id=index_kafka_sp_campaigns_realtime_aggregation_90e0b11d6585a1a_mhbgoepn,
 status=SUCCESS, duration=3727754, errorMsg=null}
   2023-01-15T11:36:01,685 INFO 
[IndexTaskClient-sp_campaigns_realtime_aggregation-3] 
org.apache.druid.indexing.overlord.MetadataTaskStorage - Updating task 
index_kafka_sp_campaigns_realtime_aggregation_90e0b11d6585a1a_mhbgoepn to 
status: 
TaskStatus{id=index_kafka_sp_campaigns_realtime_aggregation_90e0b11d6585a1a_mhbgoepn,
 status=FAILED, duration=-1, errorMsg=Task 
[index_kafka_sp_campaigns_realtime_aggregation_90e0b11d6585a1a_mhbgoepn] failed 
to stop in a ti...}
   ```
   Looking at logs of the 
index_kafka_sp_campaigns_realtime_aggregation_90e0b11d6585a1a_mhbgoepn  task it 
finished before all those
   when coordinator printed
   2023-01-15T11:34:00,802 INFO 
[IndexTaskClient-sp_campaigns_realtime_aggregation-3] 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Task 
[index_kafka_sp_campaigns_realtime_aggregation_90e0b11d6585a1a_mhbgoepn] paused 
successfully
   in it’s log I see
   2023-01-15T11:36:00,754 INFO [main] 
org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle 
[module] stage [INIT]
   Finished peon task
   
[index_kafka_sp_campaigns_realtime_aggregation_90e0b11d6585a1a_mhbgoepn.log.txt](https://github.com/apache/druid/files/10479865/index_kafka_sp_campaigns_realtime_aggregation_90e0b11d6585a1a_mhbgoepn.log.txt)
   
[coordinator.log.txt](https://github.com/apache/druid/files/10479868/coordinator.log.txt)
   
   
   looking at code where the ERROR printed “failed to stop in a timely manner, 
killing task” I’ve traced to call of this function
   
[https://github.com/apache/druid/blob/cc89c661d0aebf95bb85f01bfbedf76251e2f307/inde[…]xing/seekablestream/SeekableStreamIndexTaskClientAsyncImpl.java](https://github.com/apache/druid/blob/cc89c661d0aebf95bb85f01bfbedf76251e2f307/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamIndexTaskClientAsyncImpl.java#L144)
   which seems to return false, either when there is http error or when it’s 
notAvailable(not sure what it means)
   anyway I think in case where task already finished, it can’t respond to http 
requests…so it’s not surprising that it’s ‘failed’ to stop it
   I’m thinking there is some missing synchronisation piece since the task 
should have already marked as ‘done’/success when it was paused at first place, 
or alternatively the race between 2 threads above should be solved in some way?
   Any thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] IgorBerman opened a new issue, #13705: Race condition when killing kafka ingestion tasks with replication > 1

Reply via email to