Github user liancheng commented on the pull request:
https://github.com/apache/spark/pull/5298#issuecomment-89631527
Ah, I'm also considering similar optimizations for Spark 1.4 :)
The tricky part here is that, when scanning the Parquet table, Spark needs
to call `ParquetInputFormat.getSplits` to compute (Spark) partition
information. This `getSplits` call can be super expensive as it needs to read
footers of all Parquet part-files to compute the Parquet splits. And that's why
`ParquetRelation2` caches those footers at the very beginning and inject them
into an extended Parquet input format. With all these footers cached,
`ParquetRelation2.readSchma()` is actually quite lightweight. So the real
bottleneck is reading all those footers.
Fortunately, Parquet is also trying to avoid reading footers entirely at
the driver side (see https://github.com/apache/incubator-parquet-mr/pull/91 and
https://github.com/apache/incubator-parquet-mr/pull/45). After upgrading to
Parquet 1.6, which is expected to be released next week, we can do this
properly for better performance.
So ideally, we don't read footers on driver side, and when we have a
central arbitrative schema at hand, either from metastore or data source DDL,
we don't do schema merging at driver side either. I haven't got time to walk
through all related Parquet code path and PRs yet, so the above statements may
be inaccurate. Please correct me if you find any mistakes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]