[
https://issues.apache.org/jira/browse/HDFS-13752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551229#comment-16551229
]
Misha Dmitriev commented on HDFS-13752:
---------------------------------------
Hi [~b.maidics],
Can you share more details of your memory analysis? What version of Hive did
you use, what kind of workload, how big is your heap, what tool for heap
analysis? Can you share the original heap dump?
The reason I am asking is that I've analyzed a large number of HS2 and HMS heap
dumps over the last couple of years, and made a number of memory-related
improvements. In particular, I noticed that in some situations a lot of strings
referenced by {{URI}}s are duplicate, so we now have the
\{{StringInternUtils.internUriStringsInPath}} method. I can imagine that even
with string interning, {{URI}} objects may still _retain_ a lot of memory
because of long/numerous strings. But did you check that the {{URI}} objects
themselves really take 42% of memory? In all the realistic dumps that I've
checked, these objects take less than 1% of memory. So 42% seems more like "the
total memory used by URIs and all the strings that they reference", but even
then the number looks really big.{{}}
I agree that internally URIs are pretty wasteful, because they have a large
number of String fields (which indeed sometimes may be equal to each other).
Many of these fields seem to be unnecessary/redundant when {{URI}} is used in
HDFS {{Path}}. However, if you spend considerable time replacing URIs and save
just a very small amount of memory, your time may be better spent on something
else.
> fs.Path stores file path in java.net.URI causes big memory waste
> ----------------------------------------------------------------
>
> Key: HDFS-13752
> URL: https://issues.apache.org/jira/browse/HDFS-13752
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: fs
> Reporter: Barnabas Maidics
> Priority: Major
> Attachments: Screen Shot 2018-07-20 at 11.12.38.png
>
>
> I was looking at HiveServer2 memory usage, and a big percentage of this was
> because of org.apache.hadoop.fs.Path, where you store file paths in a
> java.net.URI object. The URI implementation stores the same string in 3
> different objects (see the attached image). In Hive when there are many
> partitions this cause a big memory usage. In my particular case 42% of memory
> was used by java.net.URI so it could be reduced to 14%.
> I wonder if the community is open to replace it with a more memory efficient
> implementation and what other things should be considered here? It can be a
> huge memory improvement for Hadoop and for Hive as well.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]