[ 
https://issues.apache.org/jira/browse/SPARK-6795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14693670#comment-14693670
 ] 

Olivier Toupin commented on SPARK-6795:
---------------------------------------

This doesn't seem to be fixed.  I built the latest branch-1.4 and removed our 
custom fix for this, and we still experience this issue. On a query to a table 
with a lot of files, the driver hang for a while will it's reading partitions. 
In the UI, if you check the timeline it's pretty clear, with our branch there 
is almost no empty space, with branch-1.4, there is a  50s void in our worst 
case.

The assumed culprit =>

1. readAllFootersInParallelUsingSummaryFiles, will default reading all footers, 
if no summary file is available. So most of the times we probably read all 
footers even if schema merging is off.

https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala#L357

2. Why do we read schema if there is metastore schema available?

Shouldn't it be 

          val dataSchema0 = maybeDataSchema
            .orElse(maybeMetastoreSchema)
            .orElse(readSchema())

?

https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala#L370

> Avoid reading Parquet footers on driver side when an global arbitrative 
> schema is available
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-6795
>                 URL: https://issues.apache.org/jira/browse/SPARK-6795
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.1
>            Reporter: Cheng Lian
>            Assignee: Cheng Lian
>            Priority: Critical
>
> With the help of [Parquet MR PR 
> #91|https://github.com/apache/incubator-parquet-mr/pull/91] which will be 
> included in the official release of Parquet MR 1.6.0, now it's possible to 
> avoid reading footers on the driver side completely when an global 
> arbitrative schema is available.
> Currently, the global schema can be either Hive metastore schema or specified 
> via data sources DDL. All tasks should verify Parquet data files and 
> reconcile possible schema conflicts locally against this global schema.
> However, when no global schema is available and schema merging is enabled, we 
> still need to read schemas from all data files to infer a valid global schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to