[jira] [Created] (SPARK-32582) Spark SQL Infer Schema Performance

Jarred Li (Jira) Mon, 10 Aug 2020 06:55:54 -0700

Jarred Li created SPARK-32582:
---------------------------------

             Summary: Spark SQL Infer Schema Performance
                 Key: SPARK-32582
                 URL: https://issues.apache.org/jira/browse/SPARK-32582
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.0.0, 2.4.6
            Reporter: Jarred Li



When infer schema is enabled, it tries to list all the files in the table, 
however only one of the file is used to read schema informaiton. The 
performance is impacted due to list all the files in the table when the number 
of partitions is larger.

 

See the code in 
"[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88]";,
 all the files in the table are input, however only one of the file's schema is 
used to infer schema.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-32582) Spark SQL Infer Schema Performance

Reply via email to