[ 
https://issues.apache.org/jira/browse/HDFS-13752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576932#comment-16576932
 ] 

Misha Dmitriev commented on HDFS-13752:
---------------------------------------

[~b.maidics] thank you for investigating this, your document is quite helpful.

So, looks like the old {{toUri().getPath()}} takes ~0.1 microsecond, and the 
new one (where a URI is re-created on demand) takes ~1 microsecond, correct? 
The 10x difference is big, but the absolute numbers are small, and I suspect 
they may be negligible compared to the cost of other calls made during HDFS 
operations. Plus, these losses may be offset by reduced GC pauses.

Still, I think ideally you should take some HDFS cluster, run some operations 
in it and measure how much time they take, replace the jar containing the code 
you changed there with your updated jar version, restart the cluster and 
run/measure the same operations again. Maybe it's enough to just e.g. list all 
the files. But hopefully now that you've spent enough time with this code, you 
may be able to guess what operations lead to the most frequent calls to 
{{toUri()}}. I would expect that these measurements show a very small 
performance difference, but this really needs to be verified.

> fs.Path stores file path in java.net.URI causes big memory waste
> ----------------------------------------------------------------
>
>                 Key: HDFS-13752
>                 URL: https://issues.apache.org/jira/browse/HDFS-13752
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 2.7.6
>         Environment: Hive 2.1.1 and hadoop 2.7.6 
>            Reporter: Barnabas Maidics
>            Priority: Major
>         Attachments: Screen Shot 2018-07-20 at 11.12.38.png, 
> heapdump-100000partitions.html, measurement.pdf
>
>
> I was looking at HiveServer2 memory usage, and a big percentage of this was 
> because of org.apache.hadoop.fs.Path, where you store file paths in a 
> java.net.URI object. The URI implementation stores the same string in 3 
> different objects (see the attached image). In Hive when there are many 
> partitions this cause a big memory usage. In my particular case 42% of memory 
> was used by java.net.URI so it could be reduced to 14%. 
> I wonder if the community is open to replace it with a more memory efficient 
> implementation and what other things should be considered here? It can be a 
> huge memory improvement for Hadoop and for Hive as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to