GitHub user liancheng opened a pull request:
https://github.com/apache/spark/pull/7396
[SPARK-8125] [SQL] Accelerates Parquet schema merging and partition
discovery
This PR tries to accelerate Parquet schema discovery and `HadoopFsRelation`
partition discovery. The acceleration is done by the following means:
- Turning off schema merging by default
Schema merging is not the most common case, but requires reading footers
of all Parquet part-files and can be very slow.
- Avoiding `FileSystem.globStatus()` call when possible
`FileSystem.globStatus()` may issue multiple synchronous RPC calls, and
can be very slow (esp. on S3). This PR adds
`SparkHadoopUtil.globPathIfNecessary()`, which only issues RPC calls when the
path contain glob-pattern specific character(s) (`{}[]*?\`).
- Listing leaf files in parallel when the number of input paths exceeds a
threshold
Listing leaf files is required by partition discovery. Currently it is
done on driver side, and can be slow when there are lots of (nested)
directories, since each `FileSystem.listStatus()` call issues an RPC. In this
PR, we list leaf files in a BFS style, and resort to a Spark job once we found
that the number of directories need to be listed exceed a threshold.
The threshold is controlled by `SQLConf` option
`spark.sql.sources.parallelPartitionDiscovery.threshold`, which defaults to 32.
- Discovering Parquet schema in parallel
Currently, schema merging is also done on driver side, and needs to read
footers of all part-files. This PR uses a Spark job to do schema merging.
Together with task side metadata reading in Parquet 1.7.0, we never read any
footers on driver side now.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/liancheng/spark accel-parquet
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/7396.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #7396
----
commit e2d07af20d1409eaa5d49235e3255e0e99f6502c
Author: Cheng Lian <[email protected]>
Date: 2015-07-01T23:32:44Z
Moves schema merging to executor side
Removes some dead code
Parallelizes input paths listing
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]