[jira] [Created] (SPARK-35190) all columns are read even if column pruning applies when spark3.0 read table written by spark2.2

xiaoli (Jira) Thu, 22 Apr 2021 04:11:49 -0700

xiaoli created SPARK-35190:
------------------------------

             Summary: all columns are read even if column pruning applies when 
spark3.0 read table written by spark2.2
                 Key: SPARK-35190
                 URL: https://issues.apache.org/jira/browse/SPARK-35190
             Project: Spark
          Issue Type: Question
          Components: Spark Core
    Affects Versions: 3.0.0
         Environment: spark3.0


set spark.sql.hive.convertMetastoreOrc=true (default value in spark3.0)

set spark.sql.orc.impl=native(default velue in spark3.0)
            Reporter: xiaoli


Before I address this issue, let me talk about the issue background: The 
current spark version we use is 2.2, and we plan to migrate to spark3.0 in near 
future. Before migration, we test some query in both spark2.2 and spark3.0 to 
check potential issue. The data source table of these query is orc format 
written by spark2.2.

 

I find that even if column pruning is applied, spark3.0’s native reader will 
read all columns.

 

Then I do remote debug. In OrcUtils.scala’s requestedColumnIds Method, it will 
check whether field name is started with “_col”. In my case, field name is 
started with “_col”, like “_col1”, “_col2”. So pruneCols is not done.  The code 
is below:

 

if (orcFieldNames.forall(_.startsWith("_col"))) {

  // This is a ORC file written by Hive, no field names in the physical schema, 
assume the

  // physical schema maps to the data scheme by index.

  _assert_(orcFieldNames.length <= dataSchema.length, "The given data schema " +

    s"*$*{dataSchema.catalogString} has less fields than the actual ORC 
physical schema, " +

    "no idea which columns were dropped, fail to read.")

  // for ORC file written by Hive, no field names

  // in the physical schema, there is a need to send the

  // entire dataSchema instead of required schema.

  // So pruneCols is not done in this case

  Some(requiredSchema.fieldNames.map { name =>

    val index = dataSchema.fieldIndex(name)

    if (index < orcFieldNames.length) {

      index

    } else {

      -1

    }

  }, false)
 
 Although this code comment explains reason, I still do not understand. This 
issue only happens in this case: spark3.0 uses native reader to read table 
written by spark2.2. 

 

In other cases, there is no such issue. I do another 2 tests:

Test1: use spark3.0’s hive reader (running with 
spark.sql.hive.convertMetastoreOrc=false and spark.sql.orc.impl=hive) to read 
the same table, it only reads pruned columns.

Test2: use spark3.0 to write a table, then use spark3.0’s native reader to read 
this new table, it only reads pruned columns.

 

This issue I mentioned is a block we use native reader in spark3.0. Can anyone 
know further reason or provide solutions?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-35190) all columns are read even if column pruning applies when spark3.0 read table written by spark2.2

Reply via email to