[GitHub] spark pull request: [SPARK-9023] [SQL] [WIP] Efficiency improvemen...

JoshRosen Thu, 16 Jul 2015 18:38:51 -0700

GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/7456


    [SPARK-9023] [SQL] [WIP] Efficiency improvements for UnsafeRows in Exchange

    This work-in-progress pull request aims to improve the performance of SQL's 
Exchange operator when shuffling UnsafeRows.  It also makes several general 
efficiency improvements to Exchange.
    
    Key changes:
    
    - When performing hash partitioning, the old Exchange projected the 
partitioning columns into a new row then passed a `(partitioningColumRow: 
InternalRow, row: InternalRow)` pair into the shuffle. This is very inefficient 
because it ends up redundantly serializing the partitioning columns only to 
immediately discard them after the shuffle.  After this patch's changes, 
Exchange now shuffles `(partitionId: Int, row: InternalRow)` pairs.  This still 
isn't optimal, since we're still shuffling extra data that we don't need, but 
it's significantly more efficient than the old implementation; in the future, 
we may be able to further optimize this once we implement a new shuffle write 
interface that accepts non-key-value-pair inputs.
    - Exchange's `compute()` method has been significantly simplified; the new 
code has less duplication and thus is easier to understand.
    - SparkPlan now has an `outputsUnsafeRows` method which allows plans to 
signal whether they will produce UnsafeRows.
    - When the Exchange's input operator produces UnsafeRows, Exchange will use 
a specialized `UnsafeRowSerializer` to serialize these rows.  This serializer 
is significantly more efficient since it simply copies the UnsafeRow's 
underlying bytes.  Note that this approach does not work for UnsafeRows that 
use the ObjectPool mechanism; I did not add support for this because we are 
planning to remove ObjectPool in the next few weeks.
    
    TODOs:
    
    - [ ] Add unit tests for the new UnsafeRow patch in Exchange.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark unsafe-exchange

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/7456.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #7456
    
----
commit 3526868a1dfa550a9375b171d96c4005df342692
Author: Josh Rosen <[email protected]>
Date:   2015-07-16T20:05:13Z

    Iniitial cut at removing shuffle on KV pairs

commit 3ca8515da53f92cae09e5f4be5fd819af062a9cf
Author: Josh Rosen <[email protected]>
Date:   2015-07-16T20:47:51Z

    Big code simplification in Exchange

commit 0f2ac867ef39abfb9521597956bf78ff8b3fadbd
Author: Josh Rosen <[email protected]>
Date:   2015-07-16T21:05:29Z

    Import ordering

commit cbea80bb1229312ed9741ba98204740355fc1434
Author: Josh Rosen <[email protected]>
Date:   2015-07-17T00:37:28Z

    Add UnsafeRowSerializer

commit 7876f31c5ae584b216f56d3fb7fc73379a011ea3
Author: Josh Rosen <[email protected]>
Date:   2015-07-17T00:55:32Z

    Merge remote-tracking branch 'origin/master' into unsafe-shuffle

commit 035af21be95fb2d22e020d2c30319940fd05ae26
Author: Josh Rosen <[email protected]>
Date:   2015-07-17T01:16:52Z

    Add logic for choosing when to use UnsafeRowSerializer

commit dd9c66d20abbea4c628c223f5a491ae661c90c93
Author: Josh Rosen <[email protected]>
Date:   2015-07-17T01:28:42Z

    Fix for copying logic

commit 8dd3ff21016bf9b5217626c5b9723f43de055def
Author: Josh Rosen <[email protected]>
Date:   2015-07-17T01:37:49Z

    Exchange outputs UnsafeRows when its child outputs them

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-9023] [SQL] [WIP] Efficiency improvemen...

Reply via email to