[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance

Jarred Li (Jira) Sun, 16 Aug 2020 01:41:08 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178445#comment-17178445
 ]


Jarred Li commented on SPARK-32582:
-----------------------------------

??I am not sure it would be helpful since there is no API in Hadoop to list 
partial files in a folder.??



We don't need to list all partitions in one table. The "sample" here means we 
sample some of the partitions not all the partitions. In the partition level, 
we can list all the files in that folder. 

 

> Spark SQL Infer Schema Performance
> ----------------------------------
>
>                 Key: SPARK-32582
>                 URL: https://issues.apache.org/jira/browse/SPARK-32582
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.6, 3.0.0
>            Reporter: Jarred Li
>            Priority: Major
>
> When infer schema is enabled, it tries to list all the files in the table, 
> however only one of the file is used to read schema informaiton. The 
> performance is impacted due to list all the files in the table when the 
> number of partitions is larger.
>  
> See the code in 
> "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L88|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88]";,
>  all the files in the table are input, however only one of the file's schema 
> is used to infer schema.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance

Reply via email to