Github user JoshRosen commented on a diff in the pull request:
https://github.com/apache/spark/pull/3422#discussion_r30993531
--- Diff:
core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala ---
@@ -205,6 +205,13 @@ private[spark] class ExternalSorter[K, V, C](
map.changeValue((getPartition(kv._1), kv._1), update)
maybeSpillCollection(usingMap = true)
}
+ } else if (bypassMergeSort) {
+ // SPARK-4479: Also bypass buffering if merge sort is bypassed to
avoid defensive copies
--- End diff --
Skipping this buffering seems to make it so that much of the rest of the
`bypassMergeSort`-handling code is no longer needed. For example, if we don't
buffer then we won't need to spill, so we can remove the code that deals with
merging spills in the `bypassMergeSort` case. Based on this, I've opened #6397
to remove all of this now-unused code and to move the handling of the
`bypassMergeSort` path into its own file. It would be great if this PR's
reviewers could look at that PR to double-check my reasoning.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]