Github user mallman commented on a diff in the pull request:
https://github.com/apache/spark/pull/16578#discussion_r142140412
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
---
@@ -0,0 +1,130 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet
+
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute,
Expression, NamedExpression}
+import org.apache.spark.sql.catalyst.planning.{PhysicalOperation,
ProjectionOverSchema, SelectedField}
+import org.apache.spark.sql.catalyst.plans.logical.{Filter, LogicalPlan,
Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.datasources.{HadoopFsRelation,
LogicalRelation}
+import org.apache.spark.sql.types.{StructField, StructType}
+
+/**
+ * Prunes unnecessary Parquet columns given a [[PhysicalOperation]] over a
+ * [[ParquetRelation]]. By "Parquet column", we mean a column as defined
in the
+ * Parquet format. In Spark SQL, a root-level Parquet column corresponds
to a
+ * SQL column, and a nested Parquet column corresponds to a
[[StructField]].
+ */
+private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] {
+ override def apply(plan: LogicalPlan): LogicalPlan =
+ plan transformDown {
+ case op @ PhysicalOperation(projects, filters,
+ l @ LogicalRelation(hadoopFsRelation @ HadoopFsRelation(_,
partitionSchema,
+ dataSchema, _, parquetFormat: ParquetFileFormat, _), _, _)) =>
+ val projectionFields = projects.flatMap(getFields)
+ val filterFields = filters.flatMap(getFields)
+ val requestedFields = (projectionFields ++ filterFields).distinct
+
+ // If [[requestedFields]] includes a proper field, continue.
Otherwise,
+ // return [[op]]
+ if (requestedFields.exists { case (_, optAtt) => optAtt.isEmpty })
{
+ val prunedSchema = requestedFields
+ .map { case (field, _) => StructType(Array(field)) }
+ .reduceLeft(_ merge _)
--- End diff --
I've updated the logic for comparing the original schema to the pruned
schema:
https://github.com/apache/spark/blob/52fddc181f32726cea1dd12a23ebf7201986be01/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala#L53-L57
The following test validates that the selection order is irrelevant in
comparing the pruned schema to the original schema:
https://github.com/apache/spark/blob/52fddc181f32726cea1dd12a23ebf7201986be01/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala#L82-L107
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]