[ 
https://issues.apache.org/jira/browse/HDFS-13752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551229#comment-16551229
 ] 

Misha Dmitriev commented on HDFS-13752:
---------------------------------------

Hi [~b.maidics],

Can you share more details of your memory analysis? What version of Hive did 
you use, what kind of workload, how big is your heap, what tool for heap 
analysis? Can you share the original heap dump?

The reason I am asking is that I've analyzed a large number of HS2 and HMS heap 
dumps over the last couple of years, and made a number of memory-related 
improvements. In particular, I noticed that in some situations a lot of strings 
referenced by {{URI}}s are duplicate, so we now have the 
\{{StringInternUtils.internUriStringsInPath}} method. I can imagine that even 
with string interning, {{URI}} objects may still _retain_ a lot of memory 
because of long/numerous strings. But did you check that the {{URI}} objects 
themselves really take 42% of memory? In all the realistic dumps that I've 
checked, these objects take less than 1% of memory. So 42% seems more like "the 
total memory used by URIs and all the strings that they reference", but even 
then the number looks really big.{{}}

I agree that internally URIs are pretty wasteful, because they have a large 
number of String fields (which indeed sometimes may be equal to each other). 
Many of these fields seem to be unnecessary/redundant when {{URI}} is used in 
HDFS {{Path}}. However, if you spend considerable time replacing URIs and save 
just a very small amount of memory, your time may be better spent on something 
else.

> fs.Path stores file path in java.net.URI causes big memory waste
> ----------------------------------------------------------------
>
>                 Key: HDFS-13752
>                 URL: https://issues.apache.org/jira/browse/HDFS-13752
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Barnabas Maidics
>            Priority: Major
>         Attachments: Screen Shot 2018-07-20 at 11.12.38.png
>
>
> I was looking at HiveServer2 memory usage, and a big percentage of this was 
> because of org.apache.hadoop.fs.Path, where you store file paths in a 
> java.net.URI object. The URI implementation stores the same string in 3 
> different objects (see the attached image). In Hive when there are many 
> partitions this cause a big memory usage. In my particular case 42% of memory 
> was used by java.net.URI so it could be reduced to 14%. 
> I wonder if the community is open to replace it with a more memory efficient 
> implementation and what other things should be considered here? It can be a 
> huge memory improvement for Hadoop and for Hive as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to