GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/6444

    [WIP] Binary processing external sort for SQL's sort-merge join

    This is a WIP commit towards implementing a binary processing cache aware 
external record sort for use in Spark SQL's sort-merge join.  The code here is 
modeled after an early design of #5868, which supported pluggable functions for 
comparing key prefixes and comparing serialized records.
    
    I'll update this PR with a detailed design description later, similar to 
the detailed descriptions and comments posted at #5868; I'm only opening this 
now so that I can run some things through Jenkins and track some code review 
comments.
    
    This patch incorporates the changes in #6222.  After that patch is merged, 
I'll rebase to exclude those commits.
    
    This will address the following JIRAs (not putting the links in the title 
yet because I don't want to send an email blast): `[SPARK-7078] [SPARK-7079] 
[SPARK-7082]`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark sql-external-sort

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/6444.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #6444
    
----
commit a67678c81a76aba72a2aa75a3488506531b310ff
Author: Josh Rosen <[email protected]>
Date:   2015-05-17T18:45:12Z

    WIP refactoring of CatalystTypeConverters

commit 640ff1c7178041f5699b7af236766e488191287b
Author: Josh Rosen <[email protected]>
Date:   2015-05-17T23:14:10Z

    Comments and cleanup

commit 6477fbd588655f2b28624f1223b7a2a3bde833f5
Author: Josh Rosen <[email protected]>
Date:   2015-05-25T00:57:26Z

    Throw ClassCastException errors during inbound conversions.

commit 7f46d9a6930d5917929a26d1d8dee7c3a025332c
Author: Josh Rosen <[email protected]>
Date:   2015-05-25T01:36:04Z

    Remove last use of convertToScala().

commit fec87a0adf096b2582039a5530e3f2cc6f9c090f
Author: Josh Rosen <[email protected]>
Date:   2015-05-25T01:40:45Z

    Fix wrong input data in InMemoryColumnarQuerySuite
    
    The schema declares an array of booleans, but we
    passed an array of integers instead.

commit fd81c599e2b42e14547e21ec7fafc33b2ded2e3c
Author: Josh Rosen <[email protected]>
Date:   2015-05-25T01:58:53Z

    Fix serialization error in UserDefinedGenerator.

commit 9543a87e066e5216cbc6e3848ec9f16c45d868b7
Author: Josh Rosen <[email protected]>
Date:   2015-05-25T01:59:07Z

    Fix null handling bug; add tests.

commit 51acd8f220c8fe05141c48eea7e235c6f7efdd52
Author: Josh Rosen <[email protected]>
Date:   2015-05-25T23:23:13Z

    Fix JavaHashingTFSuite ClassCastException

commit 81a9ecd0c47244afe9a966306e68cc87b8ef3dd2
Author: Josh Rosen <[email protected]>
Date:   2015-05-26T23:03:52Z

    Initialize converters lazily so that the attributes are resolved first

commit ee25e8d269a7ac5d36b362e36d7e6395bd9e71f4
Author: Josh Rosen <[email protected]>
Date:   2015-05-27T19:04:01Z

    Re-add convertToScala(), since a Hive test still needs it

commit 1df1c2cccbfc68c6aaebdb0f63efd0555b3cbed6
Author: Josh Rosen <[email protected]>
Date:   2015-05-15T05:55:29Z

    WIP towards external sorter for Spark SQL.
    
    This is based on an early version of my shuffle sort patch; the
    implementation will undergo significant refactoring based on
    improvements made as part of the shuffle patch. Stay tuned.

commit 356a28eaa0c1a53d187faf96e87a633283ea58f3
Author: Josh Rosen <[email protected]>
Date:   2015-05-15T21:11:32Z

    Import my original tests and get them to pass.

commit 2c6a3899602e7fb83491f59c680a106aa853b477
Author: Josh Rosen <[email protected]>
Date:   2015-05-26T03:47:07Z

    Merge in a sketch of a unit test for the new sorter (now failing).

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to