wirybeaver commented on issue #1359:
URL: 
https://github.com/apache/datafusion-ballista/issues/1359#issuecomment-4474497591

   Adding **`SplitPartitionsRule`** as the inverse of `CoalescePartitionsRule` 
(#1684): #1718.
   
   When upstream stats show one shuffle partition is far larger than the 
median, the rule fans that partition out across multiple reader tasks via 
round-robin assignment over its file list, instead of folding small partitions 
together. Same per-stage invocation, same alignment-group leaf walk, same 
carrier-slot-on-`ExchangeExec` pattern as #1684 — strict architectural mirror.
   
   **Scope limitation, called out for v1 honesty.** File-list sharding produces 
`UnknownPartitioning(K')` output, so the rule bails on any stage whose subtree 
contains a node requiring `HashPartitioned` or `SinglePartition` input (joins, 
`FinalPartitioned` aggregates). v1 helps stages where the consumer is 
distribution-agnostic (`Filter`/`Projection`/`LocalLimit` over a hash 
exchange). The TPC-H Q2 SF1000 skew that originally motivated this work (#1643) 
sits behind a `FinalPartitioned` aggregate and is not addressed by v1 — v2 
(row-range reads + aggregate-aware plan rewriting) is the path that lands 
#1643. Task doc cross-linked from #1718.
   
   Stacked on #1684; once that lands, #1718's diff reduces to the single feat 
commit on rebase.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to