GitHub user ajacques opened a pull request:
https://github.com/apache/spark/pull/21889
[SPARK-4502][SQL] Parquet nested column pruning - foundation (2nd attempt)
(Link to Jira: https://issues.apache.org/jira/browse/SPARK-4502)
**This is a restart of apache/spark#21320. Most of the discussion has taken
place over there. I've only taken it as a based to make stylistic changes based
on the code review to help move things along.**
Due to the urgency of the upcoming 2.4 code freeze, I'm going to open this
PR to collect any feedback. This can be closed if you prefer to continue to the
work in the original PR.
### What changes were proposed in this pull request?
One of the hallmarks of a column-oriented data storage format is the
ability to read data from a subset of columns, efficiently skipping reads from
other columns. Spark has long had support for pruning unneeded top-level schema
fields from the scan of a parquet file. For example, consider a table,
contacts, backed by parquet with the following Spark SQL schema:
```
root
|-- name: struct
| |-- first: string
| |-- last: string
|-- address: string
```
Parquet stores this table's data in three physical columns: name.first,
name.last and address. To answer the query
`select address from contacts`
Spark will read only from the address column of parquet data. However, to
answer the query
`select name.first from contacts`
Spark will read name.first and name.last from parquet.
This PR modifies Spark SQL to support a finer-grain of schema pruning. With
this patch, Spark reads only the name.first column to answer the previous query.
### Implementation
There are two main components of this patch. First, there is a
ParquetSchemaPruning optimizer rule for gathering the required schema fields of
a PhysicalOperation over a parquet file, constructing a new schema based on
those required fields and rewriting the plan in terms of that pruned schema.
The pruned schema fields are pushed down to the parquet requested read schema.
ParquetSchemaPruning uses a new ProjectionOverSchema extractor for rewriting a
catalyst expression in terms of a pruned schema.
Second, the ParquetRowConverter has been patched to ensure the ordinals of
the parquet columns read are correct for the pruned schema. ParquetReadSupport
has been patched to address a compatibility mismatch between Spark's built in
vectorized reader and the parquet-mr library's reader.
### Limitation
Among the complex Spark SQL data types, this patch supports parquet column
pruning of nested sequences of struct fields only.
### How was this patch tested?
Care has been taken to ensure correctness and prevent regressions. A more
advanced version of this patch incorporating optimizations for rewriting
queries involving aggregations and joins has been running on a production Spark
cluster at VideoAmp for several years. In that time, one bug was found and
fixed early on, and we added a regression test for that bug.
We forward-ported this patch to Spark master in June 2016 and have been
running this patch against Spark 2.x branches on ad-hoc clusters since then.
### Known Issues
Highlighted in
https://github.com/apache/spark/pull/21320#issuecomment-408271470
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ajacques/apache-spark
spark-4502-parquet_column_pruning-foundation
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21889.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21889
----
commit 86c572a26918e6380762508d1d868fd5ea231ed1
Author: Michael Allman <michael@...>
Date: 2016-06-24T17:21:24Z
[SPARK-4502][SQL] Parquet nested column pruning
commit 1ffbc6c8f8eaa27b50048e2124afdd2ff9e1e2b4
Author: Michael Allman <msa@...>
Date: 2018-06-04T09:59:34Z
Refactor SelectedFieldSuite to make its tests simpler and more
comprehensible
commit 97cd30b221fe78a4ea61c6c24be370fc2dfdd498
Author: Michael Allman <msa@...>
Date: 2018-06-04T10:02:28Z
Remove test "select function over nested data" of unknown origin and
purpose
commit c27e87924ed788a4253d077691401d0852306406
Author: Michael Allman <msa@...>
Date: 2018-06-04T10:45:28Z
Improve readability of ParquetSchemaPruning and
ParquetSchemaPruningSuite. Add test to exercise whether the
requested root fields in a query exclude any attributes
commit 2fc1173499e1fc4ca4205db80e78ca6f110fb5dd
Author: Michael Allman <msa@...>
Date: 2018-06-04T12:04:23Z
Don't handle non-data-field partition column names specially when
constructing the new projections in ParquetSchemaPruning. These
column projections are left unchanged by the transformDown function when
the projectionOverSchema pattern does not match
commit 40dae02ec31e4817d6afe9e738a20e3c6e6ddd03
Author: Michael Allman <msa@...>
Date: 2018-06-04T12:07:35Z
Add test coverage for ParquetSchemaPruning for partitioned tables whose
data schema includes the partition column
commit f4e11770ffe006baaec136ea77bc2cba38c8aad2
Author: Michael Allman <msa@...>
Date: 2018-06-12T00:22:40Z
Remove the ColumnarFileFormat type to put it in another PR
commit 9749d8cf644a35f28621b12920be6de912b76763
Author: Michael Allman <msa@...>
Date: 2018-06-12T02:34:19Z
Add test coverage for the enhancements to "is not null" constraint
inference for complex type extractors
commit 3421041b2da13c269f8811e78719e2d8eaaf6628
Author: Michael Allman <msa@...>
Date: 2018-06-24T19:49:33Z
Revert changes to QueryPlanConstraints.scala and
basicPhysicalOperators.scala. Remove related unit tests. This change removes
support for schema pruning on filters involving attributes of complex type
commit 4ddd78a159300cbb19afb7d12c63279f17f9a32b
Author: Michael Allman <msa@...>
Date: 2018-06-24T19:55:49Z
Revert a whitespace change in DataSourceScanExec.scala
commit 9fe4208091950e6f7af4b4ebaca6de0b720015f9
Author: Michael Allman <msa@...>
Date: 2018-07-21T09:41:11Z
Remove modifications to ParquetFileFormat.scala and
ParquetReadSupport.scala, marking broken tests in
ParquetSchemaPruningSuite.scala as "ignored"
commit 9a1f80817abfa03430594ef3a9447f030f885bd5
Author: Michael Allman <msa@...>
Date: 2018-07-21T21:47:45Z
PR review: simplify some syntax and add a code doc
commit 7ee616076f93d6cfd55b6646314f3d4a6d1530d3
Author: Adam Jacques <adam@...>
Date: 2018-07-26T01:14:35Z
Refactor code slightly based on code review comments
commit e1836d13cfd285a385e8b65f66adb7574d121429
Author: Adam Jacques <adam@...>
Date: 2018-07-26T01:49:07Z
Minor refactorings for code review
commit cd6ae2c7457b646df73fea04c33d38581a63952f
Author: Adam Jacques <adam@...>
Date: 2018-07-26T05:11:33Z
Refactorings for code reviews
commit 4b847acd8279d8c40115638faac30a4bc1736307
Author: Adam Jacques <adam@...>
Date: 2018-07-27T00:27:22Z
Mark as private
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]