GitHub user rdblue opened a pull request:
https://github.com/apache/spark/pull/13206
[SPARK-15420] [SQL] Add repartition and sort to prepare output data
## What changes were proposed in this pull request?
* WriterContainer detects that the incoming logical plan has been sorted
and does not sort a second time if the sort matches the table's partitioning,
bucketing, and sorting.
* Local sort is added by the optimizer for CatalogTables that have bucket
or sort columns. This implements `sortBy` for Hive tables.
* Repartition and sort operations are added by the optimizer for columnar
formats if enabled by spark.sql.files.columnar.insertRepartition (and does not
conflict with the query).
* Repartition and sort operations are added when
`DataFrameWriter#writersPerPartition(Int)` is set. This enables users to easily
control how many files per partition are created.
## How was this patch tested?
WIP: adding tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/rdblue/spark
SPARK-15420-parquet-repartition-sort
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/13206.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #13206
----
commit 65f253ea9a731135a952c04253e15cf5eff59151
Author: Ryan Blue <[email protected]>
Date: 2016-04-05T20:35:50Z
SPARK-14543: Update InsertInto column resolution.
This combines Hive's pre-insertion casts (without renames) that handle
partitioning with the pre-insertion casts/renames in core. The combined
rule, ResolveOutputColumns, will resolve columns by name or by position.
Resolving by position will detect cases where the number of columns is
incorrect or where the input columns are a permutation of the output
columns and fail. When resolving by name, each output column is located
by name in the child plan. This handles cases where a subset of a data
frame is written out.
commit d1339490c8421539a2f800b0b89562493414e794
Author: Ryan Blue <[email protected]>
Date: 2016-04-20T21:14:44Z
SPARK-14543: Fix bad SQL in HiveQuerySuite test.
commit 2b1193504412e5942642f45dfc641039f190f310
Author: Ryan Blue <[email protected]>
Date: 2016-04-21T23:18:08Z
SPARK-14543: Update InsertSuite test for too few columns.
This PR now catches this problem during analysis and has a better error
message. This commit updates the test for the new message and exception
type.
commit 3a24e36ceb9b815e8c933723e22dbdbfed35c840
Author: Ryan Blue <[email protected]>
Date: 2016-05-09T18:01:19Z
SPARK-14543: Update new InsertIntoTable parameter to Map.
Adding new argumetns to InsertIntoTable requires changes to several
files. Instead of adding a long list of optional args, this adds an
options map, like the one passed to DataSource. Future options can
be added and used only where they are needed.
commit 3cdbfa83d4b064fbaf9d50b3bec51f4645dad0fb
Author: Ryan Blue <[email protected]>
Date: 2016-04-22T21:15:06Z
SPARK-15420: Detect sorting and do not sort in WriteContainers.
This avoids an extra sort in the WriterContainer when data has already
been sorted as part of the query plan. This fixes writes for both
HadoopFsRelation and MetastoreRelation.
commit eed85adee841675d96fe5fa918271016a24bc2ad
Author: Ryan Blue <[email protected]>
Date: 2016-04-22T23:36:24Z
SPARK-15420: Add repartition, sort optimization.
This adds an optimizer rule that will add repartition and sort
operations to the logical plan. Sort is added when the table has sort
or bucketing columns. Repartition is added when writing columnar formats
and the option "spark.sql.files.columnar.insertRepartition" is enabled.
This also adds a `writersPerPartition(numTasks: Int)` option when
writing that controls the number of files in each output table
partition. The optimizer rule adds a repartition step that distributes
output by partition and a random value in [0, numTasks).
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]