[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3208: Add SQL query planner support for DISTRIBUTE BY

GitBox Sat, 20 Aug 2022 08:36:45 -0700


alamb commented on code in PR #3208:
URL: https://github.com/apache/arrow-datafusion/pull/3208#discussion_r950708674



##########
datafusion/expr/src/logical_plan/plan.rs:
##########
@@ -1390,8 +1399,9 @@ pub enum Partitioning {
     RoundRobinBatch(usize),
     /// Allocate rows based on a hash of one of more expressions and the 
specified number
     /// of partitions.
-    /// This partitioning scheme is not yet fully supported. See 
<https://issues.apache.org/jira/browse/ARROW-11011>
     Hash(Vec<Expr>, usize),
+    /// The DISTRIBUTE BY clause is used to repartition the data based on the 
input expressions
+    DistributeBy(Vec<Expr>),

Review Comment:
   Given the description in the spark docs 
https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-distribute-by.html
 I think either hashing or not would produce the same behavior with respect to 
partitioning.
   
   However, using just the expressions as you have in this PR is probably more 
performant -- hashing the values would be wasted work. 



##########
datafusion/core/src/physical_plan/planner.rs:
##########
@@ -729,6 +729,9 @@ impl DefaultPhysicalPlanner {
                                 .collect::<Result<Vec<_>>>()?;
                             Partitioning::Hash(runtime_expr, *n)
                         }
+                        LogicalPartitioning::DistributeBy(_) => {
+                            return 
Err(DataFusionError::NotImplemented("Physical plan does not support 
DistributeBy partitioning".to_string()))

Review Comment:
   Thanks for breaking up these projects into smaller, easier to review chunks 
👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3208: Add SQL query planner support for DISTRIBUTE BY

Reply via email to