[
https://issues.apache.org/jira/browse/HDFS-13752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16590382#comment-16590382
]
Zsolt Venczel commented on HDFS-13752:
--------------------------------------
Thanks for the patch [~b.maidics] and thanks for posting the review we talked
about [~gabor.bota]!
A few additional thoughts from my side:
* The Path class is used within all services of HDFS eg. the DataNode and
NameNode. The impact on these components would be tremendous. Introducing
SoftReference in a NameNode would induce some unwanted GC behavior especially
in larger scale clusters (the small file problem would be even more imminent).
This off course needs to be measured therefore some initial metrics would be
great.
* The toURI is used in Hadoop 2.7.6 in 237 places and ~20 sub-components. In
Hadoop trunk this number is much larger. Please revisit your calculations.
By giving a thought about the initial problem I could imagine something that
lives on the client side only and tries to introduce some caching by either
extending the Path class or transforming it to something more convenient.
> fs.Path stores file path in java.net.URI causes big memory waste
> ----------------------------------------------------------------
>
> Key: HDFS-13752
> URL: https://issues.apache.org/jira/browse/HDFS-13752
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: fs
> Affects Versions: 2.7.6
> Environment: Hive 2.1.1 and hadoop 2.7.6
> Reporter: Barnabas Maidics
> Priority: Major
> Attachments: HDFS-13752.001.patch, HDFS-13752.002.patch,
> HDFS-13752.003.patch, Screen Shot 2018-07-20 at 11.12.38.png,
> heapdump-100000partitions.html, measurement.pdf
>
>
> I was looking at HiveServer2 memory usage, and a big percentage of this was
> because of org.apache.hadoop.fs.Path, where you store file paths in a
> java.net.URI object. The URI implementation stores the same string in 3
> different objects (see the attached image). In Hive when there are many
> partitions this cause a big memory usage. In my particular case 42% of memory
> was used by java.net.URI so it could be reduced to 14%.
> I wonder if the community is open to replace it with a more memory efficient
> implementation and what other things should be considered here? It can be a
> huge memory improvement for Hadoop and for Hive as well.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]