[GitHub] [druid] gianm opened a new pull request, #13506: Sort-merge join and hash shuffles for MSQ.

GitBox Tue, 06 Dec 2022 00:06:01 -0800


gianm opened a new pull request, #13506:
URL: https://github.com/apache/druid/pull/13506


   The main changes are in the processing, multi-stage-query, and sql modules.
   
   processing module:
   
   1) Rename SortColumn to KeyColumn, replace boolean descending with KeyOrder.
      This makes it nicer to model hash keys, which use KeyOrder.NONE.
   
   2) Add nullability checkers to the FieldReader interface, and an
      "isPartiallyNullKey" method to FrameComparisonWidget. The join
      processor uses this to detect null keys.
   
   3) Add WritableFrameChannel.isClosed and OutputChannel.isReadableChannelReady
      so callers can tell which OutputChannels are ready for reading and which
      aren't.
   
   4) Specialize FrameProcessors.makeCursor to return FrameCursor, a 
random-access
      implementation. The join processor uses this to rewind when it needs to
      replay a set of rows with a particular key.
   
   5) Add MemoryAllocatorFactory, which is embedded inside FrameWriterFactory
      instead of a particular MemoryAllocator. This allows FrameWriterFactory
      to be shared in more scenarios.
   
   multi-stage-query module:
   
   1) ShuffleSpec: Add hash-based shuffles. New enum ShuffleKind helps callers
      figure out what kind of shuffle is happening. The change from SortColumn
      to KeyColumn allows ClusterBy to be used for both hash-based and 
sort-based
      shuffling.
   
   2) WorkerImpl: Add ability to handle hash-based shuffles. Refactor the logic
      to be more readable by moving the work-order-running code to the inner
      class RunWorkOrder, and the shuffle-pipeline-building code to the inner
      class ShufflePipelineBuilder.
   
   3) Add SortMergeJoinFrameProcessor and factory.
   
   4) WorkerMemoryParameters: Adjust logic to reserve space for output frames
      for hash partitioning. (We need one frame per partition.)
   
   sql module:
   
   1) Add sqlJoinAlgorithm context parameter; can be "broadcast" or
      "sortMerge". With native, it must always be "broadcast", or it's a
      validation error. MSQ supports both. Default is "broadcast" in
      both engines.
   
   2) Validate that MSQs do not use broadcast join with RIGHT or FULL join,
      as results are not correct for broadcast join with those types. Allow
      this in native for two reasons: legacy (the docs caution against it,
      but it's always been allowed), and the fact that it actually *does*
      generate correct results in native when the join is processed on the
      Broker. It is much less likely that MSQ will plan in such a way that
      generates correct results.
   
   3) Remove subquery penalty in DruidJoinQueryRel when using sort-merge
      join, because subqueries are always required, so there's no reason
      to penalize them.
   
   4) Move previously-disabled join reordering and manipulation rules to
      FANCY_JOIN_RULES, and enable them when using sort-merge join. Helps
      get to better plans where projections and filters are pushed down.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] gianm opened a new pull request, #13506: Sort-merge join and hash shuffles for MSQ.

Reply via email to