[ 
https://issues.apache.org/jira/browse/HDFS-13752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16554390#comment-16554390
 ] 

Barnabas Maidics edited comment on HDFS-13752 at 7/25/18 7:58 AM:
------------------------------------------------------------------

Seeing the implementation of the fs.Path, we can replace the Uri with 4 
strings: path, scheme, authority and fragment. I think these 4 strings are 
enough.

[~xiaochen]:
 You're absolutely right about the toUri method. A possible solution would be 
to store the URI in the fs.Path class as a Weak or SoftReference, and the toUri 
can return this, if already exists or create a new one from the strings if not 
(or if the GC collected it). So when memory is needed and we don't have strong 
reference to the URI, the GC can collect that. So this way we don't have to 
recreate the URI at every toUri calls. 

It would be a solution only if the URI isn't stored in many places. So far I 
checked many usages of the toUri method. Most of them were just transition 
calls (like String result = file.toUri().getPath()), so you don't store the URI 
itself.

Another thing we've discovered that in the getFileSystem() method you call 
FileSystem.get(this.toUri(), conf);. Here you create a Map where the key 
contains the URI. So here you would have a strong reference to it. But the key 
could also be replaced with strings, and it would have some CPU benefits as 
well (calling equals on String is possibly much faster than on URI). 

About the benchmarks: I wonder if you have any benchmarks that we could test 
the possible effects of this change. If you do how can we run it? It would be 
good to know how much these changes would effect the cpu and memory usage. 

I will continue to investigate the possible drawbacks of the change, but so far 
apart from the (I think) little CPU overhead because of the toUri method (that 
can be solved with WeakReference), I haven't found any. I don't know if you 
have any concern that should be considered here. 


was (Author: b.maidics):
Seeing the implementation of the fs.Path, we can replace the Uri with 4 
strings: path, scheme, authority and fragment. I think these 4 strings are 
enough.

[~xiaochen]:
You're absolutely right about the toUri method. A possible solution would be to 
store the URI in the fs.Path class as a Weak or SoftReference, and the toUri 
can return this, if already exists or create a new one from the strings if not 
(or if the GC collected it). So when memory is needed and we don't have strong 
reference to the URI, the GC can collect that. So this way we don't have to 
recreate the URI at every toUri calls. 

It would be a solution only if the URI isn't stored in many places. So far I 
checked many usages of the toUri method. Most of them were just transition 
calls (like String result = file.toUri().getPath()), so you don't store the URI 
itself.

Another thing we've discovered that in the getFileSystem() method you call 
FileSystem.get(this.toUri(), conf);. Here you create a Map where the key 
contains the URI. So here you would have a strong reference to it. But the key 
could also be replaced with strings, and it would have some CPU benefits as 
well (calling equals on String is possibly much faster than on URI). 

About the benchmarks: I wonder if you have any benchmarks that we could test 
the possible effects of this change. If you do how can we run it? It would be 
good to know how much these changes would effect the cpu and memory usage. 

I will continue to investigate the possible drawbacks of the change, but so far 
apart from the (I think) little CPU overhead because of the toUri method (that 
can be solved with WeakReference), I haven't found any. I don't know if you 
have any concernes that should be considered here. 

> fs.Path stores file path in java.net.URI causes big memory waste
> ----------------------------------------------------------------
>
>                 Key: HDFS-13752
>                 URL: https://issues.apache.org/jira/browse/HDFS-13752
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 2.7.6
>         Environment: Hive 2.1.1 and hadoop 2.7.6 
>            Reporter: Barnabas Maidics
>            Priority: Major
>         Attachments: Screen Shot 2018-07-20 at 11.12.38.png, 
> heapdump-100000partitions.html
>
>
> I was looking at HiveServer2 memory usage, and a big percentage of this was 
> because of org.apache.hadoop.fs.Path, where you store file paths in a 
> java.net.URI object. The URI implementation stores the same string in 3 
> different objects (see the attached image). In Hive when there are many 
> partitions this cause a big memory usage. In my particular case 42% of memory 
> was used by java.net.URI so it could be reduced to 14%. 
> I wonder if the community is open to replace it with a more memory efficient 
> implementation and what other things should be considered here? It can be a 
> huge memory improvement for Hadoop and for Hive as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to