[ 
https://issues.apache.org/jira/browse/HDFS-13752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551380#comment-16551380
 ] 

Misha Dmitriev commented on HDFS-13752:
---------------------------------------

Ok, I've downloaded and analyzed the heap dump with jxray 
([www.jxray.com)|http://www.jxray.com)/] BTW, heap dumps usually compress quite 
well, so if you post more dumps, make sure to gzip them first. The report 
generated by jxray is attached.[^heapdump-100000partitions.html]

According to jxray, URIs use 5.7% of the heap (see section 2). The ultimate 
source of truth would be running 'jmap -histo' on your HS2 JVM - the object 
size that the JVM itself reports is obviously the most accurate. Anyway, these 
numbers are not very far away, and from looking at the sample of URI objects, 
it's obvious how wasteful they are.

So I personally wouldn't object to replacing URIs with a smaller, more 
specialized equivalent. I vaguely remember that in the past I considered that, 
but found that they are used in too many places (so a lot of work to change all 
that code), and/or some "naked" URIs may be passed around by the public APIs in 
HDFS or Hive. If the latter is not the case and the former is not a problem for 
you, then I guess the effort could be justified.

> fs.Path stores file path in java.net.URI causes big memory waste
> ----------------------------------------------------------------
>
>                 Key: HDFS-13752
>                 URL: https://issues.apache.org/jira/browse/HDFS-13752
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Barnabas Maidics
>            Priority: Major
>         Attachments: Screen Shot 2018-07-20 at 11.12.38.png, 
> heapdump-100000partitions.html
>
>
> I was looking at HiveServer2 memory usage, and a big percentage of this was 
> because of org.apache.hadoop.fs.Path, where you store file paths in a 
> java.net.URI object. The URI implementation stores the same string in 3 
> different objects (see the attached image). In Hive when there are many 
> partitions this cause a big memory usage. In my particular case 42% of memory 
> was used by java.net.URI so it could be reduced to 14%. 
> I wonder if the community is open to replace it with a more memory efficient 
> implementation and what other things should be considered here? It can be a 
> huge memory improvement for Hadoop and for Hive as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to