Github user Parth-Brahmbhatt commented on the pull request:
https://github.com/apache/spark/pull/11800#issuecomment-212540659
Even before this change we were getting OOM errors. The issue primarily
seems to be creation of lot of young objects. In addition to this fix we also
moved to G1 gc and we are using -XX:NewRatio=1 to allocate half the space to
Eden.
We have deployed this fix in production since a week and we have observed
one OOM crash. The heap dump is 12GB and I am still analyzing it but initial
analysis again points at lot of string,char[] instances being created. If you
are interested I can share the heap dump.
Overall one of the big issue is during startup history server tries to load
all the logs available ( with default 7 day retention) which in a large multi
tenant cluster like ours is a lot of files. Most users won't really click
through their application but deleting the event log too early is also not a
good option. Ideally I would propose that history server creates simple summary
files (needed to actually show the application summary on UI) so the next time
history server starts it does not need to process entire event log but only a
summary file. Only when a user clicks on the application we need to process the
entire event log.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]