[ 
https://issues.apache.org/jira/browse/SPARK-7041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-7041:
------------------------------
    Description: In BypassMergeSortShuffleWriter, we may end up opening disk 
writers files for empty partitions; this occurs because we manually call 
{{open()}} after creating the writer, causing serialization and compression 
input streams to be created; these streams may write headers to the output 
stream, resulting in non-zero-length files being created for partitions that 
contain no records.  This is unnecessary, though, since the disk object writer 
will automatically open itself when the first write is performed.  Removing 
this eager {{open()}} call and rewriting the consumers to cope with the 
non-existence of empty files results in a large performance benefit for certain 
sparse workloads when using sort-based shuffle.  (was: In ExternalSorter, we 
may end up opening disk writers files for empty partitions; this occurs because 
we manually call {{open()}} after creating the writer, causing serialization 
and compression input streams to be created; these streams may write headers to 
the output stream, resulting in non-zero-length files being created for 
partitions that contain no records.  This is unnecessary, though, since the 
disk object writer will automatically open itself when the first write is 
performed.  Removing this eager {{open()}} call and rewriting the consumers to 
cope with the non-existence of empty files results in a large performance 
benefit for certain sparse workloads when using sort-based shuffle.)

> Avoid writing empty files in BypassMergeSortShuffleWriter
> ---------------------------------------------------------
>
>                 Key: SPARK-7041
>                 URL: https://issues.apache.org/jira/browse/SPARK-7041
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle
>            Reporter: Josh Rosen
>            Assignee: Josh Rosen
>
> In BypassMergeSortShuffleWriter, we may end up opening disk writers files for 
> empty partitions; this occurs because we manually call {{open()}} after 
> creating the writer, causing serialization and compression input streams to 
> be created; these streams may write headers to the output stream, resulting 
> in non-zero-length files being created for partitions that contain no 
> records.  This is unnecessary, though, since the disk object writer will 
> automatically open itself when the first write is performed.  Removing this 
> eager {{open()}} call and rewriting the consumers to cope with the 
> non-existence of empty files results in a large performance benefit for 
> certain sparse workloads when using sort-based shuffle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to