[
https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175428#comment-17175428
]
Jarred Li edited comment on SPARK-32582 at 8/11/20, 10:07 AM:
--------------------------------------------------------------
I think this is one limitation of ORC file infer schema. "fileIndex.listFiles"
list all the files in the table, while only use one file is used to get file
schema([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L96]).
{code:java}
val inferredSchema = fileFormat
.inferSchema(
sparkSession,
options,
fileIndex.listFiles(Nil, Nil).flatMap(_.files))
.map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _))
{code}
If ORC does not support mergeSchema, we should add one more method to
readSchema from one file so that it does not need to list all files. Of course,
this is not long term solution.
For long term solution, it is time consuming to iterate all the files for
schema infer even for other file format for example Parquet. I am thinking
whether we shall add parameter to sample the files for schema infer to improve
performance. We can add one more schema infer mode for
HIVE_CASE_SENSITIVE_INFERENCE to support sample. Currenly, there are 3
categories:
INFER_AND_SAVE, INFER_ONLY, NEVER_INFER
We can add one more mode "INFER_WITH_SAMPLE". By control the sample percentage,
we can control how many files should be read for schema infer.
Welcome your comments for the solution.
was (Author: leejianwei):
I think this is one limitation of ORC file infer schema. "fileIndex.listFiles"
list all the files in the table, while "" only use one file to get file
schema([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L96]).
If ORC does not support mergeSchema, we should add one more method to
readSchema from one file so that it does not need to list all files. Of course,
this is not long term solution.
For other file format, for example Parquet, it is time consuming to iterate all
the files for schema infer. I am thinking whether we shall add parameter to
sample the files for schema infer to improve performance. We can add one more
schema infer mode for HIVE_CASE_SENSITIVE_INFERENCE to support sample.
Currenly, there are 3 categories:
INFER_AND_SAVE, INFER_ONLY, NEVER_INFER
We can add one more mode "INFER_WITH_SAMPLE". By control the sample percentage,
we can control how many files should be read for schema infer.
Welcome your comments for the solution.
{code:java}
val inferredSchema = fileFormat
.inferSchema(
sparkSession,
options,
fileIndex.listFiles(Nil, Nil).flatMap(_.files))
.map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _))
{code}
> Spark SQL Infer Schema Performance
> ----------------------------------
>
> Key: SPARK-32582
> URL: https://issues.apache.org/jira/browse/SPARK-32582
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 2.4.6, 3.0.0
> Reporter: Jarred Li
> Priority: Major
>
> When infer schema is enabled, it tries to list all the files in the table,
> however only one of the file is used to read schema informaiton. The
> performance is impacted due to list all the files in the table when the
> number of partitions is larger.
>
> See the code in
> "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L88|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88]",
> all the files in the table are input, however only one of the file's schema
> is used to infer schema.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]