[
https://issues.apache.org/jira/browse/HDFS-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871865#action_12871865
]
Eli Collins commented on HDFS-1140:
-----------------------------------
Hey Dmytro,
Definitely an improvement. I noticed there's still a lot of copying going on,
readBytes copies the strings bytes to a byte array, then bytes2byteArray copies
this byte array into another byte array (it's hard for bytes2byteArray to use
readBytes w/o copying). Would it make sense to go whole hog and just use the
byte[] representation of a path internally? I understand that's a large change
but it would remove a bunch of copies and since this change is all about using
a less user-friendly abstraction in the name of reducing overhead it might be
worth considering.
* Do we need to add the new addToParent to preserve the old String-based API?
Would be nice to have FSImage use a single representation of a path.
* bytes2byteArray could use a javadoc.
* Adding and using the following helper function as you've done with isParent
would help readability.
{{boolean isRoot(byte[][] pathComp) { return pathComp.length == 1 &&
pathComp[0].length == 0; }}}
> Speedup INode.getPathComponents
> -------------------------------
>
> Key: HDFS-1140
> URL: https://issues.apache.org/jira/browse/HDFS-1140
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Dmytro Molkov
> Assignee: Dmytro Molkov
> Attachments: HDFS-1140.2.patch, HDFS-1140.patch
>
>
> When the namenode is loading the image there is a significant amount of time
> being spent in the DFSUtil.string2Bytes. We have a very specific workload
> here. The path that namenode does getPathComponents for shares N - 1
> component with the previous path this method was called for (assuming current
> path has N components).
> Hence we can improve the image load time by caching the result of previous
> conversion.
> We thought of using some simple LRU cache for components, but the reality is,
> String.getBytes gets optimized during runtime and LRU cache doesn't perform
> as well, however using just the latest path components and their translation
> to bytes in two arrays gives quite a performance boost.
> I could get another 20% off of the time to load the image on our cluster (30
> seconds vs 24) and I wrote a simple benchmark that tests performance with and
> without caching.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.