[GitHub] [spark] viirya commented on a change in pull request #33584: [SPARK-36351][SQL] Separate partition filters and data filters in PushDownUtils

GitBox Sun, 01 Aug 2021 01:07:18 -0700


viirya commented on a change in pull request #33584:
URL: https://github.com/apache/spark/pull/33584#discussion_r680469933




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScanBuilder.scala
##########
@@ -57,6 +60,15 @@ abstract class FileScanBuilder(
     StructType(fields)
   }
 
+  def setFilters(pFilters: Seq[Expression], dFilters: Seq[Expression]): Unit = 
{

Review comment:
       `pFilters`, `dFilters` are bad variable names. Readers need to guess 
what they mean. Recommend to use `partitionFilters`, `dataFilters`.

##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala
##########
@@ -57,7 +57,11 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] with 
PredicateHelper {
       // `postScanFilters` and `pushedFilters` can overlap, e.g. the parquet 
row group filter.
       val (pushedFilters, postScanFiltersWithoutSubquery) = 
PushDownUtils.pushFilters(
         sHolder.builder, normalizedFiltersWithoutSubquery)
-      val postScanFilters = postScanFiltersWithoutSubquery ++ 
normalizedFiltersWithSubquery
+      var postScanFilters = postScanFiltersWithoutSubquery ++ 
normalizedFiltersWithSubquery
+      val partitionFilters = PushDownUtils
+        .pushPartitionFilters(sHolder.builder, sHolder.relation, 
normalizedFiltersWithoutSubquery)

Review comment:
       Honestly I feel that it is somehow not good to move partial partition 
pruning from `PruneFileSourcePartitions` to this place mixed with 
`V2ScanRelationPushDown`.
   
   

##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala
##########
@@ -24,21 +24,16 @@ import 
org.apache.spark.sql.catalyst.planning.PhysicalOperation
 import org.apache.spark.sql.catalyst.plans.logical.{Filter, LeafNode, 
LogicalPlan, Project}
 import 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.FilterEstimation
 import org.apache.spark.sql.catalyst.rules.Rule
-import 
org.apache.spark.sql.execution.datasources.v2.{DataSourceV2ScanRelation, 
FileScan}
 import org.apache.spark.sql.types.StructType
 
 /**
  * Prune the partitions of file source based table using partition filters. 
Currently, this rule
- * is applied to [[HadoopFsRelation]] with [[CatalogFileIndex]] and 
[[DataSourceV2ScanRelation]]
- * with [[FileScan]].
+ * is applied to [[HadoopFsRelation]] with [[CatalogFileIndex]]. 
[[DataSourceV2ScanRelation]]
+ * with [[FileScan]] is pruned in [[PushDownUtils]].

Review comment:
       Conceptually, it is a bit weird to have two rules handling pruning 
partition.

##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala
##########
@@ -74,6 +74,34 @@ object PushDownUtils extends PredicateHelper {
     }
   }
 
+  /**
+   * Pushes down partition filters and data filters to the data source reader
+   *
+   * @return pushed partition filters.
+   */
+  def pushPartitionFilters(

Review comment:
       So you want to move partition pruning for `DataSourceV2ScanRelation` 
from `PruneFileSourcePartitions` to `pushPartitionFilters`?

##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScanBuilder.scala
##########
@@ -57,6 +60,15 @@ abstract class FileScanBuilder(
     StructType(fields)
   }
 
+  def setFilters(pFilters: Seq[Expression], dFilters: Seq[Expression]): Unit = 
{

Review comment:
       And +1 for keeping `withFilters` method name.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] viirya commented on a change in pull request #33584: [SPARK-36351][SQL] Separate partition filters and data filters in PushDownUtils

Reply via email to