Github user liancheng commented on the pull request:
https://github.com/apache/spark/pull/5298#issuecomment-92414575
@saucam Right now I feel kinda hesitant to have this. As explained in my
previous comment, the major bottleneck for Parquet metadata handling happens
when reading footers. Without eliminating this, moving schema merging to task
side doesn't bring performance benefits (although I haven't done any benchmark
for this PR yet). Plus, there are risks of introducing regressions.
However, this PR is still very valuable as it proves this approach is
doable. Eventually, we would like to have this after upgrading Parquet to 1.6.0
and add the ability to avoid reading footers on driver side whenever a global
arbitrative schema is available. I've opened [SPARK-6795] [1] to track this
issue. I will probably start working on SPARK-6795 later this month. Would you
mind me revisiting this at that time?
[1]: https://issues.apache.org/jira/browse/SPARK-6795
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]