zargor opened a new issue, #17849: URL: https://github.com/apache/druid/issues/17849
Running compaction MM-less setup with `200` task slots, there is an intermittent exception raised with the result of a failing task. So, compaction tasks fail with read timeout against overlord, usually more noticed during peak traffic time. ### Affected Version `v30.0.0` ### Description In peak traffic time we ingest by `200+` `index_kafka` tasks `6-7M` per minute rate messages. Segment granularity: `1H` Compaction task slots: `200` Middle managers count: `200+` Overlord client conns: `druid.global.http.numConnections=500` Coordinator client conns: `druid.global.http.numConnections=200` Intermitttent error message with which comaction task ends by failing: ``` 2025-03-27T16:45:21,444 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - TaskMonitor is initialized with estimatedNumSucceededTasks[245] 2025-03-27T16:45:21,444 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Starting taskMonitor 2025-03-27T16:45:21,445 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexPhaseRunner - Submitting initial tasks 2025-03-27T16:45:21,445 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexPhaseRunner - Submitting a new task for spec[coordinator-issued_compact_vlf_chhmnlfh_2025-03-27T15:59:59.432Z_0_0] 2025-03-27T16:47:21,454 INFO [ServiceClientFactory-1] org.apache.druid.rpc.ServiceClientImpl - Service [overlord] request [POST http://100.64.141.171:8088/druid/indexer/v1/task] encountered exception on attempt #1; retrying in 100 ms (org.jboss.netty.handler.timeout.ReadTimeoutException: [POST http://100.64.141.171:8088/druid/indexer/v1/task] Read timed out) 2025-03-27T16:48:22,172 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexPhaseRunner - Cleaning up resources 2025-03-27T16:48:22,172 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Stopped taskMonitor ``` ### Some debugging 1. Although not sure about the cycle of this query, I wonder if this somehow affects overlord performance in some way. This one is about metadata store and it is SQL which kicks in periodically: ``` SELECT `payload` FROM `druid_segments` WHERE `used` = TRUE ``` avg latency: 51029.51 rows: 1939077.47 2. Since the issue is related to timeout compaction task to overlord, I think I found which timeout is in place. With (RequestBuider)[https://github.com/apache/druid/blob/cee06f0b1ef6d5c8806a94a387c7b5aa5da2b302/server/src/main/java/org/apache/druid/rpc/RequestBuilder.java#L46] there is `2m` timeout which is not configurable. Suppose (overlord proxy)[https://github.com/apache/druid/blob/cee06f0/server/src/main/java/org/apache/druid/server/http/OverlordProxyServlet.java] is handling client requests, and if true, there is no possibility to tune its configuration either. ### Further questions and thoughts 1. Should we fork the repo in order to be able to tweak `RequestBuilder/ProxyServlet` configuration? 2. Could you suggest some config options that we may take into account regarding overlord responsiveness (it has enough resources though: `12-16cpu/64Gmem`) 3. Following (http client)[https://druid.apache.org/docs/latest/configuration/#http-client] config options, seems we cannot really touch the communication between clients (say compaction tasks) and overlord proxy. Though we'll try to increase `druid.global.http.clientConnectTimeout` which defaults to `500ms` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
