GitHub user cloud-fan opened a pull request:
https://github.com/apache/spark/pull/21587
[SPARK-24588][SS] streaming join should require HashClusteredPartitioning
from children
## What changes were proposed in this pull request?
In https://github.com/apache/spark/pull/19080 we simplified the
distribution/partitioning framework, and make all the join-like operators
require `HashClusteredPartitioning` from children. Unfortunately streaming join
operator was missed.
It's not a real issue. There are 2 partitionings that can satisfy
`ClusteredDistribution`: hash partitioning and range partitioning. In streaming
we don't support sort, so streaming join will not mix hash partitioning and
range partitioning, and produce wrong result. And the streaming source API
doesn't support reporting range partitioning yet.
But we should fix this potential bug.
## How was this patch tested?
N/A
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/cloud-fan/spark join
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21587.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21587
----
commit 1f3d9df26bc543802b02b9f5b20178f6255752dd
Author: Wenchen Fan <wenchen@...>
Date: 2018-06-18T23:55:47Z
StreamingSymmetricHashJoinExec should require HashClusteredPartitioning
from children
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]