yukkit commented on issue #8241:
URL:
https://github.com/apache/arrow-datafusion/issues/8241#issuecomment-1815836734
> 1. Add a new `HashPartitionedBroadcastNull` distribution method:
>
> ```rust
> /// How data is distributed amongst partitions. See [`Partitioning`] for
more
> /// details.
> #[derive(Debug, Clone)]
> pub enum Distribution {
> /// Unspecified distribution
> UnspecifiedDistribution,
> /// A single partition is required
> SinglePartition,
> /// Requires children to be distributed in such a way that the same
> /// values of the keys end up in the same partition
> HashPartitioned(Vec<Arc<dyn PhysicalExpr>>),
> /// Requires children to be distributed in such a way that the same
> /// values of the keys end up in the same partition, and the first row
having a key with nulls
> /// is broadcast to all partitions (typically used for the inner
relation of antijoin)
> HashPartitionedBroadcastNull(Vec<Arc<dyn PhysicalExpr>>),
> }
> ```
Perhaps no changes are needed in the Distribution.
>
> Is this better than modifying `HashPartitioned` with a new flag, e.g.
`HashPartitioned(Vec<Arc<dyn PhysicalExpr>>, bool)`, to control whether nulls
are distributed?
>
> 2. Add a new flag to `Partitioning::Hash`:
>
> ```rust
> pub enum Partitioning {
> /// Allocate batches using a round-robin algorithm and the specified
number of partitions
> RoundRobinBatch(usize),
> - /// Allocate rows based on a hash of one of more expressions and the
specified number of
> - /// partitions
> - Hash(Vec<Arc<dyn PhysicalExpr>>, usize),
> + /// Allocate rows based on a hash of one of more expressions, the
specified number of
> + /// partitions, and whether the first null in the input should be
broadcast to all partitions
> + Hash(Vec<Arc<dyn PhysicalExpr>>, usize, bool),
> /// Unknown partitioning scheme with a known number of partitions
> UnknownPartitioning(usize),
> }
> ```
>
> Same question here. Does it make more sense to add a flag to `Hash`
partitioning, or make a totally separate enum value?
>
I'm inclined towards adding a flag to `Hash` partitioning because broadcast
null value only occurs in hash partitioning, even if the modification of
partitioning will have a slight impact on downstream.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]