Github user mridulm commented on the pull request:
https://github.com/apache/spark/pull/8427#issuecomment-174271331
Just a note about MapOutputTracker - it is fairly trivial to make it use
bare minimum amount of memory even if it does not get cleaned up for 'old'
stages : using a disk backed map (mapdb for example) via LRU.
Which keeps utmost current and previous map output in memory and everything
else on disk (until there is a node failure requiring recomputation - which
brings portions of this back into memory).
This is what we used to do for production jobs in some earlier projects.
I am not sure what the impact of the current proposal is from memory
overhead pov - map output was (obviously) expensive enough to attempt this and
the affect was not pervasive/diffuse across the codebase for shuffle output
tracking.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]