gianm opened a new pull request, #17057:
URL: https://github.com/apache/druid/pull/17057

   This patch reworks memory management to better support multi-threaded 
workers running in shared JVMs. There are two main changes.
   
   First, processing buffers and threads are moved from a per-JVM model to a 
per-worker model. This enables queries to hold processing buffers without 
blocking other concurrently-running queries. Changes:
   
   - Introduce ProcessingBuffersSet and ProcessingBuffers to hold the 
per-worker and per-work-order processing buffers (respectively). On Peons, this 
is the JVM-wide processing pool. On Indexers, this is a per-worker pool of 
on-heap buffers. (This change fixes a bug on Indexers where excessive 
processing buffers could be used if MSQ tasks ran concurrently with realtime 
tasks.)
   
   - Add "bufferPool" argument to GroupingEngine#process so a per-worker pool 
can be passed in.
   
   - Add "druid.msq.task.memory.maxThreads" property, which controls the 
maximum number of processing threads to use per task. This allows usage of 
multiple processing buffers per task if admins desire.
   
   - IndexerWorkerContext acquires processingBuffers when creating the 
FrameContext for a work order, and releases them when closing the FrameContext.
   
   - Add "usesProcessingBuffers()" to FrameProcessorFactory so workers know how 
many sets of processing buffers are needed to run a given query.
   
   Second, adjustments to how WorkerMemoryParameters slices up bundles, to 
favor more memory for sorting and segment generation. Changes:
   
   - Instead of using same-sized bundles for processing and for sorting, 
workers now use minimally-sized processing bundles (just enough to read inputs 
plus a little overhead). The rest is devoted to broadcast data buffering, 
sorting, and segment-building.
   
   - Segment-building is now limited to 1 concurrent segment per work order. 
This allows each segment-building action to use more memory. Note that 
segment-building is internally multi-threaded to a degree. (Build and persist 
can run concurrently.)
   
   - Simplify frame size calculations by removing the distinction between 
"standard" and "large" frames. The new default frame size is the same as the 
old "standard" frames, 1 MB. The original goal of of the large frames was to 
reduce the number of temporary files during sorting, but I think we can achieve 
the same thing by simply merging a larger number of standard frames at once.
   
   - Remove the small worker adjustment that was added in #14117 to account for 
an extra frame involved in writing to durable storage. Instead, account for the 
extra frame whenever we are actually using durable storage.
   
   - Cap super-sorter parallelism using the number of output partitions, rather 
than using a hard coded cap at 4. Note that in practice, so far, this cap has 
not been relevant for tasks because they have only been using a single 
processing thread anyway.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to