Re: [PR] Parallelize storage of incremental segments (druid)

via GitHub Tue, 18 Jul 2023 09:14:04 -0700


ektravel commented on code in PR #13982:
URL: https://github.com/apache/druid/pull/13982#discussion_r1267006198



##########
docs/development/extensions-core/kafka-supervisor-reference.md:
##########
@@ -198,6 +198,7 @@ The `tuningConfig` is optional and default parameters will 
be used if no `tuning
 | `maxTotalRows`                    | Long           | The number of rows to 
aggregate across all segments; this number is post-aggregation rows. Handoff 
will happen either if `maxRowsPerSegment` or `maxTotalRows` is hit or every 
`intermediateHandoffPeriod`, whichever happens earlier.                         
                                                                                
                                                                                
                                                                                
                                                                                
                                                     | no (default == 
unlimited)                                                                      
              |
 | `intermediatePersistPeriod`       | ISO8601 Period | The period that 
determines the rate at which intermediate persists occur.                       
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                    | no (default == PT10M)     
                                                                                
   |
 | `maxPendingPersists`              | Integer        | Maximum number of 
persists that can be pending but not started. If this limit would be exceeded 
by a new intermediate persist, ingestion will block until the currently-running 
persist finishes. Maximum heap memory usage for indexing scales with 
`maxRowsInMemory` * (2 + `maxPendingPersists`).                                 
                                                                                
                                                                                
                                                                                
                                                                   | no 
(default == 0, meaning one persist can be running concurrently with ingestion, 
and none can be queued up) |
+|  numPersistThreads                | Integer        | The number of threads 
to use to create and persist incremental segments on the disk. Higher ingestion 
data throughput results in larger number of incremental segments, causing 
significant cpu time to be spent on the creation of the incremental segments on 
the disk. For datasources with number of columns running into hundreds or 
thousands, creation of the incremental segments may take up significant time, 
in the order of multiple seconds. Both these scenarios can cause ingestion can 
pause frequently or stall causing it to fall behind. With more threads the 
segment creation can be parallelized without blocking ingestion as long as 
there are sufficient cpu resources available.                                   
                      | no (default == 1)                                       
   |

Review Comment:
   ```suggestion
   |  `numPersistThreads`                | Integer        | The number of 
threads to use to create and persist incremental segments on the disk. Higher 
ingestion data throughput results in a larger number of incremental segments, 
causing significant CPU time to be spent on the creation of the incremental 
segments on the disk. For datasources with number of columns running into 
hundreds or thousands, creation of incremental segments may take up significant 
time, in the order of multiple seconds. In both of these scenarios, ingestion 
can stall or pause frequently, causing it to fall behind. You can use 
additional threads to parallelize the segment creation without blocking 
ingestion as long as there are sufficient CPU resources available.              
                                           | no (default == 1)                  
                        |
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Parallelize storage of incremental segments (druid)

Reply via email to