JoshRosen commented on a change in pull request #25953: [SPARK-29244][Core]
Prevent freed page in BytesToBytesMap free again
URL: https://github.com/apache/spark/pull/25953#discussion_r329323330
##########
File path: core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
##########
@@ -787,8 +783,16 @@ private void allocate(int capacity) {
assert (capacity >= 0);
capacity = Math.max((int) Math.min(MAX_CAPACITY,
ByteArrayMethods.nextPowerOf2(capacity)), 64);
assert (capacity <= MAX_CAPACITY);
- longArray = allocateArray(capacity * 2L);
- longArray.zeroOut();
+ try {
+ longArray = allocateArray(capacity * 2L);
+ longArray.zeroOut();
+ } catch (SparkOutOfMemoryError e) {
+ // When OOM, allocated page was already freed by `TaskMemoryManager`.
+ // We should not keep it in `longArray`. Otherwise it might be freed
again in task
+ // complete listeners and cause unnecessary error.
+ longArray = null;
Review comment:
If the `allocate()` call call throws `SparkOutOfMemoryError` then I don't
think the original `longArray = <new array>` here would change `longArray`;
instead, I think that `longArray` would continue to point to the old array. If
we simply `null` out `longArray` here then I think we may lose our reference to
the old array, but the Javadoc of this `allocate()` method says:
```java
/**
* Allocate new data structures for this map. When calling this outside of
the constructor,
* make sure to keep references to the old data structures so that you can
free them.
*
* @param capacity the new map capacity
*/
private void allocate(int capacity) {
```
and it looks like the callers are responsible for keeping an old reference
and managing cleanup. With this in mind, I'd like to dig into how the old code
worked to see if we can gain a clearer understanding of the root-cause of the
bug:
There are three calls to `allocate()` in this file:
- In the constructor
- In `growAndRehash()`
- In `reset()`
Let's look at each in turn and see whether this patch's changes modify those
call sites' behavior in the case where a `SparkOutOfMemoryError` exception is
thrown:
The constructor call is unaffected because `longArray` is initially `null`.
In `growAndRehash()` call , a `SparkOutOfMemoryError` thrown by
`growAndRehash()`'s `allocate()` call will result in us setting `longArray =
null` here then doing a `longArray = oldLongArray`
[assignment](https://github.com/apache/spark/pull/25953/files#diff-976d2d63175b5830e120d3f3b873bc76R919)
to restore the old value. Given this, I think this patch's changes are a no-op
w.r.t. the end state of `longArray` after an exception is thrown here. In this
patch, we're setting `canGrowArray = false` to prevent continuing through to
the "re-mask" step of `growAndRehash()`, but the old code never would have
reached there in case of OOM because the `SparkOutOfMemoryError` would have
been allowed to bubble up.
Finally, the `reset()` method has a `freeArray(longArray)` call, followed by
a call to `allocate(initialCapacity)`. **I think this is the source of the
original bug**:
In the `free()` method we have
```java
public void free() {
updatePeakMemoryUsed();
if (longArray != null) {
freeArray(longArray);
longArray = null;
}
[...]
```
where the caller is responsible for setting `longArray = null` after freeing
it. However, we appear to be missing this `longArray = null` in `reset()`:
there, we have
```java
/**
* Reset this map to initialized state.
*/
public void reset() {
updatePeakMemoryUsed();
numKeys = 0;
numValues = 0;
freeArray(longArray);
while (dataPages.size() > 0) {
MemoryBlock dataPage = dataPages.removeLast();
freePage(dataPage);
}
allocate(initialCapacity);
canGrowArray = true;
currentPage = null;
pageCursor = 0;
}
```
Here, if we throw an exception in `allocate(initialCapacity)` then
`longArray` will continue to point to an already-freed array. With this patch's
changes to `allocate()`'s behavior, `longArray` will be set to `null` and that
will prevent the dangling pointer to freed memory.
That seems like a somewhat indirect fix, though.
Given this (if I've understood this code correctly), what do you think about
simplifying this patch to instead update `reset()` to do
```java
freeArray(longArray);
longArray = null; // <--- added line
```
? I tried this one-line fix and the test case you added seems to pass.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]