GitHub user rdblue opened a pull request:

    https://github.com/apache/spark/pull/13206

    [SPARK-15420] [SQL] Add repartition and sort to prepare output data

    ## What changes were proposed in this pull request?
    
    * WriterContainer detects that the incoming logical plan has been sorted 
and does not sort a second time if the sort matches the table's partitioning, 
bucketing, and sorting.
    * Local sort is added by the optimizer for CatalogTables that have bucket 
or sort columns. This implements `sortBy` for Hive tables.
    * Repartition and sort operations are added by the optimizer for columnar 
formats if enabled by spark.sql.files.columnar.insertRepartition (and does not 
conflict with the query).
    * Repartition and sort operations are added when 
`DataFrameWriter#writersPerPartition(Int)` is set. This enables users to easily 
control how many files per partition are created.
    
    ## How was this patch tested?
    
    WIP: adding tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rdblue/spark 
SPARK-15420-parquet-repartition-sort

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13206.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13206
    
----
commit 65f253ea9a731135a952c04253e15cf5eff59151
Author: Ryan Blue <[email protected]>
Date:   2016-04-05T20:35:50Z

    SPARK-14543: Update InsertInto column resolution.
    
    This combines Hive's pre-insertion casts (without renames) that handle
    partitioning with the pre-insertion casts/renames in core. The combined
    rule, ResolveOutputColumns, will resolve columns by name or by position.
    Resolving by position will detect cases where the number of columns is
    incorrect or where the input columns are a permutation of the output
    columns and fail. When resolving by name, each output column is located
    by name in the child plan. This handles cases where a subset of a data
    frame is written out.

commit d1339490c8421539a2f800b0b89562493414e794
Author: Ryan Blue <[email protected]>
Date:   2016-04-20T21:14:44Z

    SPARK-14543: Fix bad SQL in HiveQuerySuite test.

commit 2b1193504412e5942642f45dfc641039f190f310
Author: Ryan Blue <[email protected]>
Date:   2016-04-21T23:18:08Z

    SPARK-14543: Update InsertSuite test for too few columns.
    
    This PR now catches this problem during analysis and has a better error
    message. This commit updates the test for the new message and exception
    type.

commit 3a24e36ceb9b815e8c933723e22dbdbfed35c840
Author: Ryan Blue <[email protected]>
Date:   2016-05-09T18:01:19Z

    SPARK-14543: Update new InsertIntoTable parameter to Map.
    
    Adding new argumetns to InsertIntoTable requires changes to several
    files.  Instead of adding a long list of optional args, this adds an
    options map, like the one passed to DataSource. Future options can
    be added and used only where they are needed.

commit 3cdbfa83d4b064fbaf9d50b3bec51f4645dad0fb
Author: Ryan Blue <[email protected]>
Date:   2016-04-22T21:15:06Z

    SPARK-15420: Detect sorting and do not sort in WriteContainers.
    
    This avoids an extra sort in the WriterContainer when data has already
    been sorted as part of the query plan. This fixes writes for both
    HadoopFsRelation and MetastoreRelation.

commit eed85adee841675d96fe5fa918271016a24bc2ad
Author: Ryan Blue <[email protected]>
Date:   2016-04-22T23:36:24Z

    SPARK-15420: Add repartition, sort optimization.
    
    This adds an optimizer rule that will add repartition and sort
    operations to the logical plan. Sort is added when the table has sort
    or bucketing columns. Repartition is added when writing columnar formats
    and the option "spark.sql.files.columnar.insertRepartition" is enabled.
    
    This also adds a `writersPerPartition(numTasks: Int)` option when
    writing that controls the number of files in each output table
    partition. The optimizer rule adds a repartition step that distributes
    output by partition and a random value in [0, numTasks).

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to