GitHub user marmbrus opened a pull request:

    https://github.com/apache/spark/pull/8010

    [SPARK-8890][SQL] Fallback on sorting when writing many dynamic partitions

    Previously, we would open a new file for each new dynamic written out using 
`FSBasedRelation`.  For formats like parquet this is very costly due to the 
buffers required to get good compression.  In this PR I refactor the code 
allowing us to fall back on an external sort when many partitions are seen.  As 
such each task will open no more than `spark.sql.sources.maxFiles` files.  I 
also did the following cleanup:
    
     - Instead of keying the file HashMap on an expensive to compute string 
representation of the partition, we now use a fairly cheap UnsafeProjection 
that avoids heap allocations.
     - The control flow for instantiating and invoking a writer container has 
been simplified.  Now instead of switching in two places based on the use of 
partitioning, the specific writer container must implement a single method 
`writeRows` that is invoked using `runJob`.
     - `InternalOutputWriter` has been removed.  Instead we have a 
`private[sql]` method `writeInternal` that converts and calls the public 
method.  This method can be overridden by internal datasources to avoid the 
conversion.  This change remove a lot of code duplication and per-row 
`asInstanceOf` checks.
     - `commands.scala` has been split up.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/marmbrus/spark fsWriting

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/8010.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #8010
    
----
commit 8ec75ac7a45982137f05ac92165e40a469dd7f54
Author: Michael Armbrust <[email protected]>
Date:   2015-08-06T22:06:40Z

    [SPARK-8890][SQL] Fallback on sorting when writing many dynamic partitions

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to