GitHub user JoshRosen opened a pull request:
https://github.com/apache/spark/pull/5622
[WIP] [SPARK-7041] Avoid writing empty files in ExternalSorter
In ExternalSorter, we may end up opening disk writers files for empty
partitions; this occurs because we manually call `open()` after creating the
writer, causing serialization and compression input streams to be created;
these streams may write headers to the output stream, resulting in
non-zero-length files being created for partitions that contain no records.
This is unnecessary, though, since the disk object writer will automatically
open itself when the first write is performed. Removing this eager open() call
and rewriting the consumers to cope with the non-existence of empty files
results in a large performance benefit for certain sparse workloads when using
sort-based shuffle.
This patch is marked as [WIP] because it incorporates code from another one
of my PRs (#5606). Submitting now so Jenkins tests it.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/JoshRosen/spark file-handle-optimizations
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/5622.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #5622
----
commit aeb680e986474e9e16bb61b8a3e165d1be1c4c2e
Author: Josh Rosen <[email protected]>
Date: 2015-04-21T07:22:55Z
[SPARK-3386] Reuse SerializerInstance in shuffle code paths
commit 64f83982439db4f2ac4e6814dcc6a6ecdea82074
Author: Josh Rosen <[email protected]>
Date: 2015-04-21T08:07:34Z
Use ThreadLocal for serializer instance in CoarseGrainedExecutorBackend
commit f661ce7f26cf1a6131bebaaa051ed25542b774a2
Author: Josh Rosen <[email protected]>
Date: 2015-04-21T17:14:39Z
Remove thread local; add comment instead
commit a21a5836b1b1b536419eb39368e1a91f37e58eb6
Author: Josh Rosen <[email protected]>
Date: 2015-04-21T18:55:11Z
Avoid IO operations on empty files in BlockObjectWriter.
commit f81918bf23ee1e0ce72878c5f818288c039afd24
Author: Josh Rosen <[email protected]>
Date: 2015-04-21T19:10:46Z
Do not create empty files at all.
commit b650ab2703fdde17f8d463dbc1257e70f3999fc8
Author: Josh Rosen <[email protected]>
Date: 2015-04-21T20:30:00Z
Reduce scope of FileOutputStream in ExternalSorter
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]