szehon-ho opened a new pull request, #42306:
URL: https://github.com/apache/spark/pull/42306
### What changes were proposed in this pull request?
- Add new conf
spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled
- Change key compatibility checks in EnsureRequirements. Remove checks
where all partition keys must be in join keys to allow isKeyCompatible = true
in this case.
- Change BatchScanExec/DataSourceV2Relation to group splits by join keys if
they differ from partition keys (previously grouped only by partition values)
- Implement partiallyClustered skew-handling.
- Group only the replicate side (now by join key as well)
- add an additional sort in the end of partitions based on join key, as
when we group the non-replicate side, partition ordering becomes out of order.
### Why are the changes needed?
- Support Storage Partition Join in cases where the join condition does not
contain all the partition keys, but just some of them
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested? -Added tests in KeyGroupedPartitioningSuite
-Found two problems, will address in separate PR:
- https://github.com/apache/spark/pull/37886 made another change so that we
have to select all join keys, otherwise DSV2 scan does not report
KeyGroupedPartitioning and SPJ does not get triggered. Need to see how to
relax this.
- https://issues.apache.org/jira/browse/SPARK-44641 was found when testing
this change. This pr refactors some of those code to add group-by-join-key,
but doesnt change the underlying logic, so issue continues to exist. Hopefully
this will also get fixed in another way.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]