[ 
https://issues.apache.org/jira/browse/HDFS-13752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16590038#comment-16590038
 ] 

Gabor Bota commented on HDFS-13752:
-----------------------------------

Thanks for the v3 patch and for the measurements [~b.maidics]!
My comments on measurement.pdf:
 - toUri().getPath() testing was a good idea with original, and new URI. 
Storing the Path without the URI is a clear winner in terms of memory 
consumption, but we can also see a 10x increase in call times. IMHO I would not 
use SoftReference for this purpose, the overhead for it would be significant, 
and I think it's better be used when the objects held are somewhat bigger in 
size per instance.
 - I see that removing toUri when just certain URI values needed would be 
significantly faster, but this should be done in all components (usages) that 
use {{hadoop-common}}/{{org.apache.hadoop.fs.Path}}, so it would be a bigger 
effort to do it. If we make this change, we can count on the 10x time when 
calling toUri().getPath() as long as all clients using the Path class will not 
change that behavior.
 - Your test on a cluster is a good starting point, but I still advise you to 
do a TPC-DS to prove better the point that we need this change, and the change 
will not cause any significant increase with {{toUri().getPath()}} call times.

Some first-pass review on patch v3:
 - The new {{String path}} creates a field that other methods already use for 
{{org.apache.hadoop.fs.Path}}. Please use another name to avoid this.
 - Please fix all checkstyle issues. If you use IntelliJ as your IDE, you can 
import hadoop checkstyle from checkstyle/checkstyle.xml and run locally on the 
files you modify before submitting the patch.
 - In {{Path#toUri}} you may want to use another name than just tmp.
- It would also be nice to include some test for this change.

Thanks for [~zvenczel] for the offline brainstorm on this issue.

> fs.Path stores file path in java.net.URI causes big memory waste
> ----------------------------------------------------------------
>
>                 Key: HDFS-13752
>                 URL: https://issues.apache.org/jira/browse/HDFS-13752
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 2.7.6
>         Environment: Hive 2.1.1 and hadoop 2.7.6 
>            Reporter: Barnabas Maidics
>            Priority: Major
>         Attachments: HDFS-13752.001.patch, HDFS-13752.002.patch, 
> HDFS-13752.003.patch, Screen Shot 2018-07-20 at 11.12.38.png, 
> heapdump-100000partitions.html, measurement.pdf
>
>
> I was looking at HiveServer2 memory usage, and a big percentage of this was 
> because of org.apache.hadoop.fs.Path, where you store file paths in a 
> java.net.URI object. The URI implementation stores the same string in 3 
> different objects (see the attached image). In Hive when there are many 
> partitions this cause a big memory usage. In my particular case 42% of memory 
> was used by java.net.URI so it could be reduced to 14%. 
> I wonder if the community is open to replace it with a more memory efficient 
> implementation and what other things should be considered here? It can be a 
> huge memory improvement for Hadoop and for Hive as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to