[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance

Jarred Li (Jira) Tue, 11 Aug 2020 03:05:09 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175428#comment-17175428
 ]


Jarred Li commented on SPARK-32582:
-----------------------------------

I think this is one limitation of ORC file infer schema. "fileIndex.listFiles" 
list all the files in the table, while "" only use one file to get file 
schema([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L96]).

 

If ORC does not support mergeSchema, we should add one more method to 
readSchema from one file so that it does not need to list all files. Of course, 
this is not long term solution. 

 

For other file format, for example Parquet, it is time consuming to iterate all 
the files for schema infer. I am thinking whether we shall add parameter to 
sample the files for schema infer to improve performance.  We can add one more 
schema infer mode for HIVE_CASE_SENSITIVE_INFERENCE to support sample. 
Currenly, there are 3 categories: 

INFER_AND_SAVE, INFER_ONLY, NEVER_INFER

We can add one more mode "INFER_WITH_SAMPLE". By control the sample percentage, 
we can control how many files should be read for schema infer. 

Welcome your comments for the solution.

 

 
{code:java}
val inferredSchema = fileFormat
  .inferSchema(
    sparkSession,
    options,
    fileIndex.listFiles(Nil, Nil).flatMap(_.files))
  .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _))
{code}
 

 

> Spark SQL Infer Schema Performance
> ----------------------------------
>
>                 Key: SPARK-32582
>                 URL: https://issues.apache.org/jira/browse/SPARK-32582
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.6, 3.0.0
>            Reporter: Jarred Li
>            Priority: Major
>
> When infer schema is enabled, it tries to list all the files in the table, 
> however only one of the file is used to read schema informaiton. The 
> performance is impacted due to list all the files in the table when the 
> number of partitions is larger.
>  
> See the code in 
> "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L88|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88]";,
>  all the files in the table are input, however only one of the file's schema 
> is used to infer schema.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance

Reply via email to