[GitHub] [spark] AngersZhuuuu commented on a change in pull request #28805: [SPARK-28169][SQL] Convert scan predicate condition to CNF

GitBox Tue, 23 Jun 2020 08:19:25 -0700


AngersZhuuuu commented on a change in pull request #28805:
URL: https://github.com/apache/spark/pull/28805#discussion_r444305821




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala
##########
@@ -87,8 +88,17 @@ private[sql] object PruneFileSourcePartitions extends 
Rule[LogicalPlan] {
             _,
             _))
         if filters.nonEmpty && fsRelation.partitionSchemaOption.isDefined =>
-      val (partitionKeyFilters, _) = getPartitionKeyFiltersAndDataFilters(
-        fsRelation.sparkSession, logicalRelation, partitionSchema, filters, 
logicalRelation.output)
+      val predicates = 
conjunctiveNormalFormAndGroupExpsByReference(filters.reduceLeft(And))

Review comment:
       > I still don't see the rationale. What if we don't do the group by and 
simply apply the CNF conversion?
   
   Such a case 
   ```
   TBL: test
   PARTITION COLS : dt
   
   SELECT * FROM test where (dt = 2 and  id < 100 and id > 20 ) or dt = 3
   ```
   
   if we don't group by reference, the condition ` (dt = 2 and  id < 100 and id 
> 20 ) or dt = 3` will be converted to 
   ```
   (dt = 3 or dt = 2) and (dt = 3 or id < 100) and (dt = 3 or id > 20)
   ```
   but we know that only  `(dt = 3 or dt = 2) ` can be predicated as partition 
pruning,  we can combine  `id < 100` and `id > 20` by grouByReference, and 
return as 
   ```
   (dt = 3 or dt = 2) and (dt = 3 or (id < 100 and  id > 20))
   ```
    
   In other word ,  since in final strategies of partition pruning,  we 
partition predicate filter by judge if it's references is subset of partCols,  
if we combine condition group by reference, Here we result with `or` expression 
and `or` can't be split by `splitConjunctivePredicates`, it won't impact final 
push down result. 
   
   
   It's ok to just simply  apply CNF rule, but group by references can avoid 
generate unnecessary expression  to control the length of generated final exprs.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] AngersZhuuuu commented on a change in pull request #28805: [SPARK-28169][SQL] Convert scan predicate condition to CNF

Reply via email to