yukkit commented on issue #8241:
URL: 
https://github.com/apache/arrow-datafusion/issues/8241#issuecomment-1815836734

   > 1. Add a new `HashPartitionedBroadcastNull` distribution method:
   > 
   > ```rust
   > /// How data is distributed amongst partitions. See [`Partitioning`] for 
more
   > /// details.
   > #[derive(Debug, Clone)]
   > pub enum Distribution {
   >     /// Unspecified distribution
   >     UnspecifiedDistribution,
   >     /// A single partition is required
   >     SinglePartition,
   >     /// Requires children to be distributed in such a way that the same
   >     /// values of the keys end up in the same partition
   >     HashPartitioned(Vec<Arc<dyn PhysicalExpr>>),
   >     /// Requires children to be distributed in such a way that the same
   >     /// values of the keys end up in the same partition, and the first row 
having a key with nulls
   >     /// is broadcast to all partitions (typically used for the inner 
relation of antijoin)
   >     HashPartitionedBroadcastNull(Vec<Arc<dyn PhysicalExpr>>),
   > }
   > ```
   
   Perhaps no changes are needed in the Distribution.
   
   > 
   > Is this better than modifying `HashPartitioned` with a new flag, e.g. 
`HashPartitioned(Vec<Arc<dyn PhysicalExpr>>, bool)`, to control whether nulls 
are distributed?
   > 
   > 2. Add a new flag to `Partitioning::Hash`:
   > 
   > ```rust
   >  pub enum Partitioning {
   >      /// Allocate batches using a round-robin algorithm and the specified 
number of partitions
   >      RoundRobinBatch(usize),
   > -    /// Allocate rows based on a hash of one of more expressions and the 
specified number of
   > -    /// partitions
   > -    Hash(Vec<Arc<dyn PhysicalExpr>>, usize),
   > +    /// Allocate rows based on a hash of one of more expressions, the 
specified number of
   > +    /// partitions, and whether the first null in the input should be 
broadcast to all partitions
   > +    Hash(Vec<Arc<dyn PhysicalExpr>>, usize, bool),
   >      /// Unknown partitioning scheme with a known number of partitions
   >      UnknownPartitioning(usize),
   >  }
   > ```
   > 
   > Same question here. Does it make more sense to add a flag to `Hash` 
partitioning, or make a totally separate enum value?
   > 
   
   I'm inclined towards adding a flag to `Hash` partitioning because broadcast 
null value only occurs in hash partitioning, even if the modification of 
partitioning will have a slight impact on downstream.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to