Github user squito commented on the pull request:
https://github.com/apache/spark/pull/8427#issuecomment-134743609
@markhamstra yup, no question this will increase memory usage. The
question is, should we consider it anyway? Maybe you were implicitly answering
"no", but I'm gonna make my case again in any case :)
Clearly, if you have long running jobs w/ lots of stages, and you never do
anything to clean them up, then `stageIdToStage` is going to eat up all your
memory. But that will happen anyway, you'll already run out of memory because
of `MapOutputTracker` storing shuffle output (and most likely the huge number
of RDDs you've created that can't be gc'ed either). We add a few more hashmap
entries and more `Stage` objects, which shouldn't contain anything huge -- no
bigger than what we are already tracking. Certainly it'll have an effect,
though.
I think its a pretty big usability improvement, so worth considering, but
that is totally subjective. I realize this is a bit hand wavy now -- I'll try
to quantify the memory usage effect so we can make a more informed decision (if
others are still interested somewhat).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]