[I] Compaction tasks fail with read timeout against overlord intermittently (druid)

via GitHub Fri, 04 Apr 2025 11:36:38 -0700


zargor opened a new issue, #17849:
URL: https://github.com/apache/druid/issues/17849


   Running compaction MM-less setup with `200` task slots, there is an 
intermittent exception raised with the result of a failing task.
   
   So, compaction tasks fail with read timeout against overlord, usually more 
noticed during peak traffic time.
   
   ### Affected Version
   
   `v30.0.0`
   
   ### Description
   
   In peak traffic time we ingest by `200+` `index_kafka` tasks `6-7M` per 
minute rate messages.
   Segment granularity:        `1H`
   Compaction task slots:     `200`
   Middle managers count:   `200+`
   Overlord client conns:       `druid.global.http.numConnections=500`
   Coordinator client conns:  `druid.global.http.numConnections=200`
   
   Intermitttent error message with which comaction task ends by failing:
   ```
   2025-03-27T16:45:21,444 INFO [task-runner-0-priority-0] 
org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - TaskMonitor 
is initialized with estimatedNumSucceededTasks[245]
   2025-03-27T16:45:21,444 INFO [task-runner-0-priority-0] 
org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Starting 
taskMonitor
   2025-03-27T16:45:21,445 INFO [task-runner-0-priority-0] 
org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexPhaseRunner - 
Submitting initial tasks
   2025-03-27T16:45:21,445 INFO [task-runner-0-priority-0] 
org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexPhaseRunner - 
Submitting a new task for 
spec[coordinator-issued_compact_vlf_chhmnlfh_2025-03-27T15:59:59.432Z_0_0]
   2025-03-27T16:47:21,454 INFO [ServiceClientFactory-1] 
org.apache.druid.rpc.ServiceClientImpl - Service [overlord] request [POST 
http://100.64.141.171:8088/druid/indexer/v1/task] encountered exception on 
attempt #1; retrying in 100 ms 
(org.jboss.netty.handler.timeout.ReadTimeoutException: [POST 
http://100.64.141.171:8088/druid/indexer/v1/task] Read timed out)
   2025-03-27T16:48:22,172 INFO [task-runner-0-priority-0] 
org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexPhaseRunner - 
Cleaning up resources
   2025-03-27T16:48:22,172 INFO [task-runner-0-priority-0] 
org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Stopped 
taskMonitor
   
   ``` 
   
   ### Some debugging
   
   1. Although not sure about the cycle of this query, I wonder if this somehow 
affects overlord performance in some way. This one is about metadata store and 
it is SQL which kicks in periodically:
   ```
   SELECT `payload` FROM `druid_segments` WHERE `used` = TRUE 
   ```
   
   avg latency: 51029.51
   rows:             1939077.47
   
   2. Since the issue is related to timeout compaction task to overlord, I 
think I found which timeout is in place.
   With 
(RequestBuider)[https://github.com/apache/druid/blob/cee06f0b1ef6d5c8806a94a387c7b5aa5da2b302/server/src/main/java/org/apache/druid/rpc/RequestBuilder.java#L46]
 there is `2m` timeout which is not configurable. 
   Suppose (overlord 
proxy)[https://github.com/apache/druid/blob/cee06f0/server/src/main/java/org/apache/druid/server/http/OverlordProxyServlet.java]
 is handling client requests, and if true, there is no possibility to tune its 
configuration either.
   
   ### Further questions and thoughts
   
   1. Should we fork the repo in order to be able to tweak 
`RequestBuilder/ProxyServlet` configuration?
   
   2. Could you suggest some config options that we may take into account 
regarding overlord responsiveness (it has enough resources though: 
`12-16cpu/64Gmem`)
   
   3. Following (http 
client)[https://druid.apache.org/docs/latest/configuration/#http-client] config 
options, seems we cannot really touch the communication between clients (say 
compaction tasks) and overlord proxy. Though we'll try to increase 
`druid.global.http.clientConnectTimeout` which defaults to `500ms`
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Compaction tasks fail with read timeout against overlord intermittently (druid)

Reply via email to