Github user sitalkedia commented on the issue:
https://github.com/apache/spark/pull/15722
@davies - Yes, we dumped the logging and confirmed that the OOM is because
we are not freeing the `LongArray` while reseting the `BytesToBytesMap`. The
job which used to fail because of OOM runs fine with this change.
As explained above, the situation will lead to OOM when already running
task is allocated more than fair share of its memory as a result of delay in
scheduling by the scheduler. The `LongArray` itself can grown beyond the fair
share of memory for the task (We have use cases where the `LongArray` is
consuming significant portion of total memory because of too many keys) and
later when the task spills, the `LongArray` is not freed as a result of which
subsequent memory allocation request is denied by the memory manager resulting
in OOM.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]