[
https://issues.apache.org/jira/browse/SENTRY-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16084686#comment-16084686
]
Hadoop QA commented on SENTRY-1827:
-----------------------------------
Here are the results of testing the latest attachment
https://issues.apache.org/jira/secure/attachment/12876949/SENTRY-1827.03-sentry-ha-redesign.patch
against sentry-ha-redesign.
{color:red}Overall:{color} -1 due to an error
{color:red}ERROR:{color} failed to apply patch (exit code 1):
The patch does not appear to apply with p0, p1, or p2
Console output:
https://builds.apache.org/job/PreCommit-SENTRY-Build/3022/console
This message is automatically generated.
> Minimize TPathsDump thrift message used in HDFS sync
> ----------------------------------------------------
>
> Key: SENTRY-1827
> URL: https://issues.apache.org/jira/browse/SENTRY-1827
> Project: Sentry
> Issue Type: Improvement
> Affects Versions: 1.8.0, sentry-ha-redesign
> Reporter: Misha Dmitriev
> Assignee: Misha Dmitriev
> Fix For: 1.8.0
>
> Attachments: SENTRY-1827.01.patch,
> SENTRY-1827.01-sentry-ha-redesign.patch, SENTRY-1827.02.patch,
> SENTRY-1827.02-sentry-ha-redesign.patch, SENTRY-1827.03.patch,
> SENTRY-1827.03-sentry-ha-redesign.patch, SENTRY-1827.04.patch
>
>
> We obtained a heap dump taken from the JVM running Hive Metastore at the time
> when Sentry HDFS sync operation was performed. I've analyzed this dump with
> jxray (www.jxray.com) and found that a significant percentage of memory is
> wasted due to duplicate strings:
> {code}
> 7. DUPLICATE STRINGS
> Total strings: 29,986,017 Unique strings: 9,640,413 Duplicate values:
> 4,897,743 Overhead: 2,570,746K (9.4%)
> {code}
> Of them, more than 1/3 come from sentry:
> {code}
> 917,331K (3.3%), 10517636 dup strings (498477 unique), 10517636 dup backing
> arrays:
> <-- org.apache.sentry.hdfs.service.thrift.TPathEntry.pathElement <--
> {j.u.HashMap}.values <--
> org.apache.sentry.hdfs.service.thrift.TPathsDump.nodeMap <--
> org.apache.sentry.hdfs.service.thrift.TPathsUpdate.pathsDump <-- Java
> Local@7fea0851c360 (org.apache.sentry.hdfs.service.thrift.TPathsUpdate)
> {code}
> The duplicate strings in memory have been eliminated by SENTRY-1811. However,
> when these strings are serialized into the TPathsDump thrift message, they
> are duplicated again. That is, if there are 3 different TPathEntry objects
> with the same pathElement="foo", then (even if there is only one interned
> copy of the "foo" string in memory), a separate copy of "foo" will be written
> to the serialized message for each of these 3 TPathEntries. This is one
> reason why TPathsDump serialized messages may get very big, consume a lot of
> memory and take long time to send over the network.
> To address this problem we may use some form of custom compression, where we
> don't write multiple copies of duplicate strings, but rather substitute them
> with some shorter "string ids".
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)