[
https://issues.apache.org/jira/browse/TEZ-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15142893#comment-15142893
]
Jason Lowe commented on TEZ-3115:
---------------------------------
When auto-parallelism kicks in we're going to see many copies of the same
upstream task attempt IDs, host:port, etc. We should at least consider
interning or otherwise sharing these, or potentially just storing the raw ID
and generating the string when necessary on-the-fly. MapHost is another
example of many redundancies, since it stores the fully qualified host name and
port at least three times (as part of baseUrl, identifier, and hostIdentifier).
I wonder if it would be better overall to have MapHost be more efficiently
stored and generate the URLs and identifiers on-demand.
> Shuffle string handling adds significant memory overhead
> --------------------------------------------------------
>
> Key: TEZ-3115
> URL: https://issues.apache.org/jira/browse/TEZ-3115
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.7.0
> Reporter: Jason Lowe
>
> While investigating the OOM heap dump from TEZ-3114 I noticed that the
> ShuffleManager and other shuffle-related objects were holding onto many
> strings that added up to over a hundred megabytes of memory.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)