Github user zecevicp commented on the issue:
https://github.com/apache/spark/pull/21109
There's no design doc. I didn't feel the change was big enough to warrant
one.
1. Currently there is no spill-over to disk. If the range is too big, users
can switch this off and use the much slower SMJ version, without an OOM.
Implementing spill-over doesn't look trivial because it's more dynamic than the
original version. It's not clear how to implement that. Maybe we can add that
in the future, once we figure it out?
2. This whole optimization doesn't apply when there is no equal condition.
3. I didn't understand this case you're describing. Can you elaborate,
please? Either way, only one pass through the data is needed, skewed or not
skewed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]