GitHub user JoshRosen opened a pull request:
https://github.com/apache/spark/pull/7456
[SPARK-9023] [SQL] [WIP] Efficiency improvements for UnsafeRows in Exchange
This work-in-progress pull request aims to improve the performance of SQL's
Exchange operator when shuffling UnsafeRows. It also makes several general
efficiency improvements to Exchange.
Key changes:
- When performing hash partitioning, the old Exchange projected the
partitioning columns into a new row then passed a `(partitioningColumRow:
InternalRow, row: InternalRow)` pair into the shuffle. This is very inefficient
because it ends up redundantly serializing the partitioning columns only to
immediately discard them after the shuffle. After this patch's changes,
Exchange now shuffles `(partitionId: Int, row: InternalRow)` pairs. This still
isn't optimal, since we're still shuffling extra data that we don't need, but
it's significantly more efficient than the old implementation; in the future,
we may be able to further optimize this once we implement a new shuffle write
interface that accepts non-key-value-pair inputs.
- Exchange's `compute()` method has been significantly simplified; the new
code has less duplication and thus is easier to understand.
- SparkPlan now has an `outputsUnsafeRows` method which allows plans to
signal whether they will produce UnsafeRows.
- When the Exchange's input operator produces UnsafeRows, Exchange will use
a specialized `UnsafeRowSerializer` to serialize these rows. This serializer
is significantly more efficient since it simply copies the UnsafeRow's
underlying bytes. Note that this approach does not work for UnsafeRows that
use the ObjectPool mechanism; I did not add support for this because we are
planning to remove ObjectPool in the next few weeks.
TODOs:
- [ ] Add unit tests for the new UnsafeRow patch in Exchange.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/JoshRosen/spark unsafe-exchange
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/7456.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #7456
----
commit 3526868a1dfa550a9375b171d96c4005df342692
Author: Josh Rosen <[email protected]>
Date: 2015-07-16T20:05:13Z
Iniitial cut at removing shuffle on KV pairs
commit 3ca8515da53f92cae09e5f4be5fd819af062a9cf
Author: Josh Rosen <[email protected]>
Date: 2015-07-16T20:47:51Z
Big code simplification in Exchange
commit 0f2ac867ef39abfb9521597956bf78ff8b3fadbd
Author: Josh Rosen <[email protected]>
Date: 2015-07-16T21:05:29Z
Import ordering
commit cbea80bb1229312ed9741ba98204740355fc1434
Author: Josh Rosen <[email protected]>
Date: 2015-07-17T00:37:28Z
Add UnsafeRowSerializer
commit 7876f31c5ae584b216f56d3fb7fc73379a011ea3
Author: Josh Rosen <[email protected]>
Date: 2015-07-17T00:55:32Z
Merge remote-tracking branch 'origin/master' into unsafe-shuffle
commit 035af21be95fb2d22e020d2c30319940fd05ae26
Author: Josh Rosen <[email protected]>
Date: 2015-07-17T01:16:52Z
Add logic for choosing when to use UnsafeRowSerializer
commit dd9c66d20abbea4c628c223f5a491ae661c90c93
Author: Josh Rosen <[email protected]>
Date: 2015-07-17T01:28:42Z
Fix for copying logic
commit 8dd3ff21016bf9b5217626c5b9723f43de055def
Author: Josh Rosen <[email protected]>
Date: 2015-07-17T01:37:49Z
Exchange outputs UnsafeRows when its child outputs them
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]