GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/5622

    [WIP] [SPARK-7041] Avoid writing empty files in ExternalSorter

    In ExternalSorter, we may end up opening disk writers files for empty 
partitions; this occurs because we manually call `open()` after creating the 
writer, causing serialization and compression input streams to be created; 
these streams may write headers to the output stream, resulting in 
non-zero-length files being created for partitions that contain no records. 
This is unnecessary, though, since the disk object writer will automatically 
open itself when the first write is performed. Removing this eager open() call 
and rewriting the consumers to cope with the non-existence of empty files 
results in a large performance benefit for certain sparse workloads when using 
sort-based shuffle.
    
    This patch is marked as [WIP] because it incorporates code from another one 
of my PRs (#5606).  Submitting now so Jenkins tests it.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark file-handle-optimizations

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5622.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5622
    
----
commit aeb680e986474e9e16bb61b8a3e165d1be1c4c2e
Author: Josh Rosen <[email protected]>
Date:   2015-04-21T07:22:55Z

    [SPARK-3386] Reuse SerializerInstance in shuffle code paths

commit 64f83982439db4f2ac4e6814dcc6a6ecdea82074
Author: Josh Rosen <[email protected]>
Date:   2015-04-21T08:07:34Z

    Use ThreadLocal for serializer instance in CoarseGrainedExecutorBackend

commit f661ce7f26cf1a6131bebaaa051ed25542b774a2
Author: Josh Rosen <[email protected]>
Date:   2015-04-21T17:14:39Z

    Remove thread local; add comment instead

commit a21a5836b1b1b536419eb39368e1a91f37e58eb6
Author: Josh Rosen <[email protected]>
Date:   2015-04-21T18:55:11Z

    Avoid IO operations on empty files in BlockObjectWriter.

commit f81918bf23ee1e0ce72878c5f818288c039afd24
Author: Josh Rosen <[email protected]>
Date:   2015-04-21T19:10:46Z

    Do not create empty files at all.

commit b650ab2703fdde17f8d463dbc1257e70f3999fc8
Author: Josh Rosen <[email protected]>
Date:   2015-04-21T20:30:00Z

    Reduce scope of FileOutputStream in ExternalSorter

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to