xiaoli created SPARK-35190:
------------------------------
Summary: all columns are read even if column pruning applies when
spark3.0 read table written by spark2.2
Key: SPARK-35190
URL: https://issues.apache.org/jira/browse/SPARK-35190
Project: Spark
Issue Type: Question
Components: Spark Core
Affects Versions: 3.0.0
Environment: spark3.0
set spark.sql.hive.convertMetastoreOrc=true (default value in spark3.0)
set spark.sql.orc.impl=native(default velue in spark3.0)
Reporter: xiaoli
Before I address this issue, let me talk about the issue background: The
current spark version we use is 2.2, and we plan to migrate to spark3.0 in near
future. Before migration, we test some query in both spark2.2 and spark3.0 to
check potential issue. The data source table of these query is orc format
written by spark2.2.
I find that even if column pruning is applied, spark3.0’s native reader will
read all columns.
Then I do remote debug. In OrcUtils.scala’s requestedColumnIds Method, it will
check whether field name is started with “_col”. In my case, field name is
started with “_col”, like “_col1”, “_col2”. So pruneCols is not done. The code
is below:
if (orcFieldNames.forall(_.startsWith("_col"))) {
// This is a ORC file written by Hive, no field names in the physical schema,
assume the
// physical schema maps to the data scheme by index.
_assert_(orcFieldNames.length <= dataSchema.length, "The given data schema " +
s"*$*{dataSchema.catalogString} has less fields than the actual ORC
physical schema, " +
"no idea which columns were dropped, fail to read.")
// for ORC file written by Hive, no field names
// in the physical schema, there is a need to send the
// entire dataSchema instead of required schema.
// So pruneCols is not done in this case
Some(requiredSchema.fieldNames.map { name =>
val index = dataSchema.fieldIndex(name)
if (index < orcFieldNames.length) {
index
} else {
-1
}
}, false)
Although this code comment explains reason, I still do not understand. This
issue only happens in this case: spark3.0 uses native reader to read table
written by spark2.2.
In other cases, there is no such issue. I do another 2 tests:
Test1: use spark3.0’s hive reader (running with
spark.sql.hive.convertMetastoreOrc=false and spark.sql.orc.impl=hive) to read
the same table, it only reads pruned columns.
Test2: use spark3.0 to write a table, then use spark3.0’s native reader to read
this new table, it only reads pruned columns.
This issue I mentioned is a block we use native reader in spark3.0. Can anyone
know further reason or provide solutions?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]