GitHub user viirya opened a pull request:
https://github.com/apache/spark/pull/16603
[SPARK-19244][Core] Sort MemoryConsumers according to their memory usage
when spilling
## What changes were proposed in this pull request?
In `TaskMemoryManager `, when we acquire memory by calling
`acquireExecutionMemory` and we can't acquire required memory, we will try to
spill other memory consumers.
Currently, we simply iterates the memory consumers in a hash set. Normally
each time the consumer will be iterated in the same order.
The first issue is that we might spill additional consumers. For example,
if consumer 1 uses 10MB, consumer 2 uses 50MB, then consumer 3 acquires 100MB
but we can only get 60MB and spilling is needed. We might spill both consumer 1
and consumer 2. But we actually just need to spill consumer 2 and get the
required 100MB.
The second issue is that if we spill consumer 1 in first time spilling.
After a while, consumer 1 now uses 5MB. Then consumer 4 may acquire some memory
and spilling is needed again. Because we iterate the memory consumers in the
same order, we will spill consumer 1 again. So for consumer 1, we will produce
many small spilling files.
This patch modifies the way iterating the memory consumers. It sorts the
memory consumers by their memory usage. So the consumer using more memory will
spill first. Once it is spilled, even it acquires few memory again, in next
time spilling happens it will not be the consumers to spill again if there are
other consumers using more memory than it.
## How was this patch tested?
Jenkins tests.
Please review http://spark.apache.org/contributing.html before opening a
pull request.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/viirya/spark-1 sort-memoryconsumer-when-spill
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/16603.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #16603
----
commit 4c2b7b02e809614993d25b21aee3e1d55355e482
Author: Liang-Chi Hsieh <[email protected]>
Date: 2017-01-16T08:57:57Z
Sort MemoryConsumers according to their memory usage when spilling.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]