[GitHub] [spark] sunchao opened a new pull request, #36068: [SPARK-37377][SQL][3.3] Initial implementation of Storage-Partitioned Join

GitBox Mon, 04 Apr 2022 19:45:48 -0700


sunchao opened a new pull request, #36068:
URL: https://github.com/apache/spark/pull/36068

This is a backport of #35657 to `branch-3.3`
<!--
Thanks for sending a pull request! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://spark.apache.org/contributing.html
2. Ensure you have added or run the appropriate tests for your PR:
https://spark.apache.org/developer-tools.html
3. If the PR is unfinished, add '[WIP]' in your PR title, e.g.,
'[WIP][SPARK-XXXX] Your PR title ...'.
4. Be sure to keep the PR description updated to reflect all changes.
5. Please write your PR title to summarize what this PR proposes.
6. If possible, provide a concise example to reproduce the issue for a
faster review.
7. If you want to add a new configuration, please read the guideline first
for naming configurations in

'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
8. If you want to add or modify an error type or message, please read the
guideline first in
'core/src/main/resources/error/README.md'.
-->

### What changes were proposed in this pull request?

This PR introduces the initial implementation of Storage-Partitioned Join
([SPIP](https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE)).

Changes:
- `org.apache.spark.sql.connector.read.partitioning.Partitioning` currently
is very limited (as mentioned in the SPIP), and cannot be extended to handle
join cases. This PR completely replace it following the catalyst `Partitioning`
interface, and added two concrete sub-classes: `KeyGroupedPartitioning` and
`UnknownPartitioning`. This allows a V2 data source to report to Spark it's
partition transform expressions, via `SupportsReportPartitioning` interface.
- with the above change,
`org.apache.spark.sql.connector.read.partitioning.Distribution` and
`org.apache.spark.sql.connector.read.partitioning.ClusteredDistribution` now
are replaced by classes with the same name in
`org.apache.spark.sql.connector.distributions` package. Therefore, this PR
marks the former two as deprecated.
- `DataSourcePartitioning` used to be in
`org.apache.spark.sql.execution.datasources.v2`. This moves it into package
`org.apache.spark.sql.catalyst.plans.physical` and renames it to
`KeyGroupedPartitioning`, so that it can be extended for more non-V2 use cases,
such as Hive bucketing. In addition, it is also changed to accommodate the
Storage-Partitioned Join feature.
- a new expression type: `TransformExpression`, is introduced to bind
syntactic partition transforms with their semantic meaning, represented by a V2
function. This expression is un-evaluable for now, and is used later in
`EnsureRequirements` to check whether join children are compatible with each
other.
- a new optimizer rule: `V2ScanPartitioning`, is added to recognize `Scan`s
implement `SupportsReportPartitioning`. If they do, this rule converts V2
partition transform expressions into their counterparts in catalyst, and
annotate `DataSourceV2ScanRelation` with the result. These are later propagated
into `DataSourceV2ScanExecBase`.
- changes are made in `DataSourceV2ScanExecBase` to create
`KeyGroupedPartitioning` for scan if 1) the scan is annotated with catalyst
partition transform expressions, and 2) if all input splits implement
`HasPartitionKey`.
- A new config: `spark.sql.sources.v2.bucketing.enabled` is introduced to
turn on or off the behavior. By default it is false.

### Why are the changes needed?

Spark currently support bucketing in DataSource V1, but not in V2. This is
the first step to support bucket join, and is general form, storage-partitioned
join, for V2 data sources. In addition, the work here can potentially used to
support Hive bucketing as well. Please check the SPIP for details.

### Does this PR introduce _any_ user-facing change?

With the changes, a user can now:
- have V2 data sources to report distribution and ordering to Spark on read
path
- Spark will recognize the distribution property and eliminate shuffle in
join/aggregate/window, etc, when the source distribution matches the required
distribution from these.
- a new config `spark.sql.sources.v2.bucketing.enabled` is introduced to
turn on/off the above behavior

### How was this patch tested?

- Added a new test suite `KeyGroupedPartitioningSuite` covers end-to-end
tests on the new feature
- Extended `EnsureRequirementsSuite` to cover `DataSourcePartitioning`
- Some existing test classes, such as `InMemoryTable` are extended to cover
the changes

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] sunchao opened a new pull request, #36068: [SPARK-37377][SQL][3.3] Initial implementation of Storage-Partitioned Join

Reply via email to