GitHub user tejasapatil opened a pull request:
https://github.com/apache/spark/pull/18843
[SPARK-21595] Separate thresholds for buffering and spilling in
ExternalAppendOnlyUnsafeRowArray
## What changes were proposed in this pull request?
[SPARK-21595](https://issues.apache.org/jira/browse/SPARK-21595) reported
that there is excessive spilling to disk due to default spill threshold for
`ExternalAppendOnlyUnsafeRowArray` being quite small for WINDOW operator. Old
behaviour of WINDOW operator (pre https://github.com/apache/spark/pull/16909)
would hold data in an array for first 4096 records post which it would switch
to `UnsafeExternalSorter` and start spilling to disk after reaching
`spark.shuffle.spill.numElementsForceSpillThreshold` (or earlier if there was
paucity of memory due to excessive consumers).
Currently the (switch from in-memory to `UnsafeExternalSorter`) and
(`UnsafeExternalSorter` spilling to disk) for
`ExternalAppendOnlyUnsafeRowArray` is controlled by a single threshold. This PR
aims to separate that to have more granular control.
## How was this patch tested?
Added unit tests
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tejasapatil/spark SPARK-21595
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/18843.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #18843
----
commit 8e3bfb7715e366a64da6add80253373af7d07915
Author: Tejas Patil <[email protected]>
Date: 2017-08-04T00:53:03Z
Separate thresholds for buffering and spilling
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]