[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning

mallman Mon, 02 Oct 2017 06:37:31 -0700

Github user mallman commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16578#discussion_r142140412
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
 ---
    @@ -0,0 +1,130 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.parquet
    +
    +import org.apache.spark.sql.catalyst.expressions.{And, Attribute, 
Expression, NamedExpression}
    +import org.apache.spark.sql.catalyst.planning.{PhysicalOperation, 
ProjectionOverSchema, SelectedField}
    +import org.apache.spark.sql.catalyst.plans.logical.{Filter, LogicalPlan, 
Project}
    +import org.apache.spark.sql.catalyst.rules.Rule
    +import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
LogicalRelation}
    +import org.apache.spark.sql.types.{StructField, StructType}
    +
    +/**
    + * Prunes unnecessary Parquet columns given a [[PhysicalOperation]] over a
    + * [[ParquetRelation]]. By "Parquet column", we mean a column as defined 
in the
    + * Parquet format. In Spark SQL, a root-level Parquet column corresponds 
to a
    + * SQL column, and a nested Parquet column corresponds to a 
[[StructField]].
    + */
    +private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] {
    +  override def apply(plan: LogicalPlan): LogicalPlan =
    +    plan transformDown {
    +      case op @ PhysicalOperation(projects, filters,
    +          l @ LogicalRelation(hadoopFsRelation @ HadoopFsRelation(_, 
partitionSchema,
    +            dataSchema, _, parquetFormat: ParquetFileFormat, _), _, _)) =>
    +        val projectionFields = projects.flatMap(getFields)
    +        val filterFields = filters.flatMap(getFields)
    +        val requestedFields = (projectionFields ++ filterFields).distinct
    +
    +        // If [[requestedFields]] includes a proper field, continue. 
Otherwise,
    +        // return [[op]]
    +        if (requestedFields.exists { case (_, optAtt) => optAtt.isEmpty }) 
{
    +          val prunedSchema = requestedFields
    +            .map { case (field, _) => StructType(Array(field)) }
    +            .reduceLeft(_ merge _)
    --- End diff --
    
    I've updated the logic for comparing the original schema to the pruned 
schema:
    
    
https://github.com/apache/spark/blob/52fddc181f32726cea1dd12a23ebf7201986be01/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala#L53-L57
    
    The following test validates that the selection order is irrelevant in 
comparing the pruned schema to the original schema:
    
    
https://github.com/apache/spark/blob/52fddc181f32726cea1dd12a23ebf7201986be01/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala#L82-L107



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning

Reply via email to