GitHub user oliviertoupin opened a pull request:
https://github.com/apache/spark/pull/7049
[SPARK-6910] [WiP] Reduce number of operations to the cluster.
Here is my workaround for SPARK-6910.
It doesn't prune early. It takes a different approach. Instead our
investigation showed that during a query w/ hivecontext on a parquet
partitioned table w/ many partitions and many files per partitions (totalling
~40K files), the metastore was poked a lot, but more importantly the datanodes
were solicited A LOT. Turns out when doing a query spark would read the footers
of ALL parquet files in order to build the schema. This PR read to footers lazy
only if needed.
Also:
* The metastore already contains the schema, so we use it when available
(always when using metastore tables?) to avoid reading the footers.
* We don't need to poke that much files in order to figure out the schema,
unless we do schema merging, but this in the normal path (w/o schema merging),
however, there is a bug or at least a big caveat in parquet-mr. Spark when
using `readAllFootersInParallelUsingSummaryFiles` will end up reading ALL the
footers for that tables. This because when there is no summary, parquet-mr
revert to `readAllFootersInParallel`
I'm not sure if it's mergeable as-is, this PR fit our requirement and we
use in our build, but maybe not those of the whole community.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/oliviertoupin/spark spark-6910
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/7049.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #7049
----
commit 594ec0299fd4ce7fd4d45e8d6fa7c7ba74e3cb39
Author: Olivier Toupin <[email protected]>
Date: 2015-06-19T19:51:41Z
More efficient way of poking the namenode.
commit 717c815f2f76bf0a289a507738e818f510f94fbe
Author: Olivier Toupin <[email protected]>
Date: 2015-06-26T14:26:01Z
Lazy footers discovery + Trust metastore schema. This reduce the number of
poked files drastically on when a table have a lot of files and/or partitions
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]