[GitHub] [spark] cloud-fan commented on a change in pull request #25919: [WIP][SPARK-15616][SQL] Hive table supports partition pruning in JoinSelection

GitBox Thu, 24 Oct 2019 08:22:35 -0700

cloud-fan commented on a change in pull request #25919: [WIP][SPARK-15616][SQL] 
Hive table supports partition pruning in JoinSelection
URL: https://github.com/apache/spark/pull/25919#discussion_r338637820


 ##########
 File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala
 ##########
 @@ -231,6 +232,68 @@ case class RelationConversions(
   }
 }
 
+/**
+ * TODO: merge this with PruneFileSourcePartitions after we completely make 
hive as a data source.
+ */
+case class PruneHiveTablePartitions(
+  session: SparkSession) extends Rule[LogicalPlan] with PredicateHelper {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
+    case filter @ Filter(condition, relation: HiveTableRelation) if 
relation.isPartitioned =>
+      val predicates = splitConjunctivePredicates(condition)
+      val normalizedFilters = predicates.map { e =>
+        e transform {
+          case a: AttributeReference =>
+            a.withName(relation.output.find(_.semanticEquals(a)).get.name)
+        }
+      }
+      val partitionSet = AttributeSet(relation.partitionCols)
+      // SPARK-24085: remove scalar subquery in partition expression then get 
normalized predicates
+      val pruningPredicates = normalizedFilters
+        .filterNot(SubqueryExpression.hasSubquery)
+        .filter { predicate =>
+          !predicate.references.isEmpty && 
predicate.references.subsetOf(partitionSet)
+        }
+      val conf = session.sessionState.conf
+      if (conf.metastorePartitionPruning && pruningPredicates.nonEmpty) {
+        val prunedPartitions = 
session.sharedState.externalCatalog.listPartitionsByFilter(
+          relation.tableMeta.database,
+          relation.tableMeta.identifier.table,
+          pruningPredicates,
+          conf.sessionLocalTimeZone)
+        val sizeInBytes = try {
+          val sizeOfPartitions = prunedPartitions.map { part =>
+            val rawDataSize = 
part.parameters.get(StatsSetupConst.RAW_DATA_SIZE).map(_.toLong)
+            val totalSize = 
part.parameters.get(StatsSetupConst.TOTAL_SIZE).map(_.toLong)
+            if (rawDataSize.isDefined && rawDataSize.get > 0) {
+              rawDataSize.get
+            } else if (totalSize.isDefined && totalSize.get > 0L) {
+              totalSize.get
+            } else if (conf.fallBackToHdfsForStatsEnabled) {
+              CommandUtils.calculateLocationSize(
+                session.sessionState, relation.tableMeta.identifier, 
part.storage.locationUri)
+            } else { // we cannot get any size statics here. Use 0 as the 
default size to sum up.
+              0L
+            }
+          }.sum
+          // If size of partitions is zero fall back to the default size.
+          if (sizeOfPartitions == 0L) conf.defaultSizeInBytes else 
sizeOfPartitions
+        } catch {
+          case e: IOException =>
+            logWarning("Failed to get table size from HDFS.", e)
+            conf.defaultSizeInBytes
+        }
+        val withStats = relation.tableMeta.copy(
+          stats = Some(CatalogStatistics(sizeInBytes = BigInt(sizeInBytes))))
+        val prunedHiveTableRelation = relation.copy(tableMeta = withStats,
+          normalizedFilters = pruningPredicates, prunedPartitions = 
prunedPartitions)
 
 Review comment:
   Why do we need to keep `pruningPredicates`? IIUC the approach should be very 
simply:
   1. this rule only changes `HiveTableRelation` to hold an optional partition 
list.
   2. the `HiveTableScanExec` will get the partition list from 
`HiveTableRelation` or call `listPartitionsByFilter`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan commented on a change in pull request #25919: [WIP][SPARK-15616][SQL] Hive table supports partition pruning in JoinSelection

Reply via email to