gianm commented on code in PR #13503:
URL: https://github.com/apache/druid/pull/13503#discussion_r1040495755


##########
docs/ingestion/tasks.md:
##########
@@ -343,6 +343,27 @@ You can override the task priority by setting your 
priority in the task context
   "priority" : 100
 }
 ```
+<a name="actions"></a>
+
+## Task actions
+
+These are various overlord actions performed by tasks during their lifecycle. 
Some typical actions are as follows:
+- `lockAcquire`: acquires a time-chunk lock on an interval for the task 
+- `lockRelease`: releases a lock acquired by the task on an interval
+- `segmentInsertion`: inserts segments into metadata store

Review Comment:
   Worth mentioning `segmentTransactionalInsert` too?



##########
docs/ingestion/tasks.md:
##########
@@ -343,6 +343,27 @@ You can override the task priority by setting your 
priority in the task context
   "priority" : 100
 }
 ```
+<a name="actions"></a>
+
+## Task actions
+
+These are various overlord actions performed by tasks during their lifecycle. 
Some typical actions are as follows:
+- `lockAcquire`: acquires a time-chunk lock on an interval for the task 
+- `lockRelease`: releases a lock acquired by the task on an interval
+- `segmentInsertion`: inserts segments into metadata store
+- `segmentAllocate`: allocates pending segments to a task to write rows
+- etc.
+
+### Batching `segmentAllocate` actions
+
+In a cluster with a large number of concurrent tasks (say > 1000), 
`segmentAllocate` actions on the overlord may take very long intervals of time 
to finish thus causing spikes in the `task/action/run/time`. This may result in 
lag building up while a task waits for a segment to get allocated.

Review Comment:
   I think this is going to be useful even at much lower concurrent task 
counts. I have seen people have issues with allocation timings around task 
rollover even with just 10s of tasks, requiring a scale-up of metadata store 
size.



##########
docs/configuration/index.md:
##########
@@ -1112,6 +1112,8 @@ These Overlord static configurations can be defined in 
the `overlord/runtime.pro
 |`druid.indexer.storage.type`|Choices are "local" or "metadata". Indicates 
whether incoming tasks should be stored locally (in heap) or in metadata 
storage. "local" is mainly for internal testing while "metadata" is recommended 
in production because storing incoming tasks in metadata storage allows for 
tasks to be resumed if the Overlord should fail.|local|
 |`druid.indexer.storage.recentlyFinishedThreshold`|Duration of time to store 
task results. Default is 24 hours. If you have hundreds of tasks running in a 
day, consider increasing this threshold.|PT24H|
 |`druid.indexer.tasklock.forceTimeChunkLock`|_**Setting this to false is still 
experimental**_<br/> If set, all tasks are enforced to use time chunk lock. If 
not set, each task automatically chooses a lock type to use. This configuration 
can be overwritten by setting `forceTimeChunkLock` in the [task 
context](../ingestion/tasks.md#context). See [Task Locking & 
Priority](../ingestion/tasks.md#context) for more details about locking in 
tasks.|true|
+|`druid.indexer.tasklock.batchSegmentAllocation`| If set to true, segment 
allocate actions are performed in batches to improve the throughput and reduce 
the average `task/action/run/time`.|false|
+|`druid.indexer.tasklock.batchAllocationWaitTime`|Milliseconds to wait between 
adding the first segment allocate action to a batch and executing that batch. 
The waiting time allows the batch to add more requests and thus improve the 
average segment allocation run time. This configuration takes effect only if 
`batchSegmentAllocation` is enabled. <br> This value should be decreased __only 
if__ there are failures while allocating segments due to metadata operations on 
a very large batch.|500|

Review Comment:
   500 milliseconds already seems pretty short. If people are getting failures 
due to too-large batches, I wonder if reducing this will reliably fix their 
problem. Maybe it's better to suggest they turn off batch segment allocation 
completely.
   
   Ideally, the batching would be robust to this case, and avoid generating 
oversized metadata operations. Is that possible?



##########
docs/operations/metrics.md:
##########
@@ -230,8 +230,14 @@ Note: If the JVM does not support CPU time measurement for 
the current thread, `
 
|------|-----------|------------------------------------------------------------|------------|
 |`task/run/time`|Milliseconds taken to run a task.| `dataSource`, `taskId`, 
`taskType`, `taskStatus`|Varies|
 |`task/pending/time`|Milliseconds taken for a task to wait for running.| 
`dataSource`, `taskId`, `taskType`|Varies|
-|`task/action/log/time`|Milliseconds taken to log a task action to the audit 
log.| `dataSource`, `taskId`, `taskType`|< 1000 (subsecond)|
-|`task/action/run/time`|Milliseconds taken to execute a task action.| 
`dataSource`, `taskId`, `taskType`|Varies from subsecond to a few seconds, 
based on action type.|
+|`task/action/log/time`|Milliseconds taken to log a task action to the audit 
log.| `dataSource`, `taskId`, `taskType`, `taskActionType`|< 1000 (subsecond)|
+|`task/action/run/time`|Milliseconds taken to execute a task action.| 
`dataSource`, `taskId`, `taskType`, `taskActionType`|Varies from subsecond to a 
few seconds, based on action type.|
+|`task/action/success/count`|Number of task actions that were executed 
successfully during the emission period. Currently only being emitted for 
batched `segmentAllocate` actions.| `dataSource`, `taskId`, `taskType`, 
`taskActionType`|Varies|
+|`task/action/failed/count`|Number of task actions that failed during the 
emission period. Currently only being emitted for batched `segmentAllocate` 
actions.| `dataSource`, `taskId`, `taskType`, `taskActionType`|Varies|

Review Comment:
   Would be good to link all mentions of "batched segmentAllocate actions" to 
`../ingestion/tasks.md#batching-segmentallocate-actions`.



##########
docs/configuration/index.md:
##########
@@ -1112,6 +1112,8 @@ These Overlord static configurations can be defined in 
the `overlord/runtime.pro
 |`druid.indexer.storage.type`|Choices are "local" or "metadata". Indicates 
whether incoming tasks should be stored locally (in heap) or in metadata 
storage. "local" is mainly for internal testing while "metadata" is recommended 
in production because storing incoming tasks in metadata storage allows for 
tasks to be resumed if the Overlord should fail.|local|
 |`druid.indexer.storage.recentlyFinishedThreshold`|Duration of time to store 
task results. Default is 24 hours. If you have hundreds of tasks running in a 
day, consider increasing this threshold.|PT24H|
 |`druid.indexer.tasklock.forceTimeChunkLock`|_**Setting this to false is still 
experimental**_<br/> If set, all tasks are enforced to use time chunk lock. If 
not set, each task automatically chooses a lock type to use. This configuration 
can be overwritten by setting `forceTimeChunkLock` in the [task 
context](../ingestion/tasks.md#context). See [Task Locking & 
Priority](../ingestion/tasks.md#context) for more details about locking in 
tasks.|true|
+|`druid.indexer.tasklock.batchSegmentAllocation`| If set to true, segment 
allocate actions are performed in batches to improve the throughput and reduce 
the average `task/action/run/time`.|false|

Review Comment:
   Would be good to link to 
`../ingestion/tasks.md#batching-segmentallocate-actions`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to