Jarred Li created SPARK-32582:
---------------------------------
Summary: Spark SQL Infer Schema Performance
Key: SPARK-32582
URL: https://issues.apache.org/jira/browse/SPARK-32582
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 3.0.0, 2.4.6
Reporter: Jarred Li
When infer schema is enabled, it tries to list all the files in the table,
however only one of the file is used to read schema informaiton. The
performance is impacted due to list all the files in the table when the number
of partitions is larger.
See the code in
"[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88]",
all the files in the table are input, however only one of the file's schema is
used to infer schema.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]