advancedxy commented on a change in pull request #25919: [SPARK-15616][SQL] 
Hive table supports partition pruning in JoinSelection
URL: https://github.com/apache/spark/pull/25919#discussion_r329346535
 
 

 ##########
 File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala
 ##########
 @@ -232,6 +234,63 @@ case class RelationConversions(
   }
 }
 
+/**
+ *
+ * TODO: merge this with PruneFileSourcePartitions after we completely make 
hive as a data source.
+ */
+case class PruneHiveTablePartitions(
+  session: SparkSession) extends Rule[LogicalPlan] with PredicateHelper {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
+    case filter @ Filter(condition, relation: HiveTableRelation) if 
relation.isPartitioned =>
+      val predicates = splitConjunctivePredicates(condition)
+      val normalizedFilters = predicates.map { e =>
+        e transform {
+          case a: AttributeReference =>
+            a.withName(relation.output.find(_.semanticEquals(a)).get.name)
+        }
+      }
+      val partitionSet = AttributeSet(relation.partitionCols)
+      val pruningPredicates = normalizedFilters.filter { predicate =>
+        !predicate.references.isEmpty &&
+          predicate.references.subsetOf(partitionSet)
+      }
+      val conf = session.sessionState.conf
+      if (pruningPredicates.nonEmpty && conf.fallBackToHdfsForStatsEnabled &&
+        conf.metastorePartitionPruning) {
+        val prunedPartitions = 
session.sharedState.externalCatalog.listPartitionsByFilter(
 
 Review comment:
   > How about we keep the listed partitions in HiveTableRelation?
   
   This is a good one. However, we may have to add two fields: `pruningFilters: 
Seq[Expression]` and `prunedPartitions: Seq[CatalogTablePartition]`, and I 
believe they are complicating the `HiveTableRelation`. Another things is that 
`HiveTableRelation` might be copied multiple times, we may lost the 
`prunedPartitions` field by accident if not copied carefully. 
   
   If that's not the problem, storing listed partitions in `HiveTableRelation` 
is a good choice. 
   WDYT @cloud-fan?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to