[
https://issues.apache.org/jira/browse/SPARK-6795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14693670#comment-14693670
]
Olivier Toupin commented on SPARK-6795:
---------------------------------------
This doesn't seem to be fixed. I built the latest branch-1.4 and removed our
custom fix for this, and we still experience this issue. On a query to a table
with a lot of files, the driver hang for a while will it's reading partitions.
In the UI, if you check the timeline it's pretty clear, with our branch there
is almost no empty space, with branch-1.4, there is a 50s void in our worst
case.
The assumed culprit =>
1. readAllFootersInParallelUsingSummaryFiles, will default reading all footers,
if no summary file is available. So most of the times we probably read all
footers even if schema merging is off.
https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala#L357
2. Why do we read schema if there is metastore schema available?
Shouldn't it be
val dataSchema0 = maybeDataSchema
.orElse(maybeMetastoreSchema)
.orElse(readSchema())
?
https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala#L370
> Avoid reading Parquet footers on driver side when an global arbitrative
> schema is available
> -------------------------------------------------------------------------------------------
>
> Key: SPARK-6795
> URL: https://issues.apache.org/jira/browse/SPARK-6795
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.1
> Reporter: Cheng Lian
> Assignee: Cheng Lian
> Priority: Critical
>
> With the help of [Parquet MR PR
> #91|https://github.com/apache/incubator-parquet-mr/pull/91] which will be
> included in the official release of Parquet MR 1.6.0, now it's possible to
> avoid reading footers on the driver side completely when an global
> arbitrative schema is available.
> Currently, the global schema can be either Hive metastore schema or specified
> via data sources DDL. All tasks should verify Parquet data files and
> reconcile possible schema conflicts locally against this global schema.
> However, when no global schema is available and schema merging is enabled, we
> still need to read schemas from all data files to infer a valid global schema.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]