[
https://issues.apache.org/jira/browse/HIVE-19937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532187#comment-16532187
]
Misha Dmitriev commented on HIVE-19937:
---------------------------------------
Thank you for sharing the jxray report, [~stakiar].
If it reflects the situation in real-life applications accurately enough, then
it looks like the sources of duplicate strings are not as much {{JobConf}}
tables as various other things, that you can easily see if you expand the
"Expensive fields" and "Full reference chains" in section 7:
# Most of the duplicate strings (~9% out of 13.5% total) come from data fields
of {{java.net.URI}}. All these URIs, in turn, come from
{{org.apache.hadoop.fs.Path.uri}}. {{Path}}s come from more than one sources,
but the biggest one is this reference chain:
{code:java}
↖java.net.URI.schemeSpecificPart
↖org.apache.hadoop.fs.Path.uri
↖{j.u.LinkedHashMap}.keys
↖org.apache.hadoop.hive.ql.plan.MapWork.pathToAliases{code}
It turns out that in the past I have already taken care of interning strings in
such URIs, see e.g. this method in MapWork.java:
{code:java}
public void setPathToAliases(final LinkedHashMap<Path, ArrayList<String>>
pathToAliases) {
for (Path p : pathToAliases.keySet()) {
StringInternUtils.internUriStringsInPath(p);
}
this.pathToAliases = pathToAliases;
}{code}
but it turns out that there are also other methods that can add {{Path}}s to
{{pathToAliases}}: two flavors of {{addPathToAlias()}} and maybe something
else. I think we need to modify all these methods so that they also call
{{StringInternUtils.internUriStringsInPath()}} for {{Path}}s that are passed to
them. This will remove the said 9% of duplicate strings.
# One other source of duplicate strings in URIs referenced by {{Path}}s is the
map in {{ProjectionPusher.pathToPartitionInfo}}. I think this would be fixed if
in the following line in this class
{code:java}
pathToPartitionInfo.put(Path.getPathWithoutSchemeAndAuthority(entry.getKey()),
...{code}
you insert the {{StringInternUtils.internUriStringsInPath()}} call.
# The very first line in the "Full reference chains" says that 2% of memory is
wasted by duplicate strings that are values in {{CopyOnFirstWriteProperties}}
tables, that are reachable via this reference chain
{code:java}
org.apache.hadoop.hive.common.CopyOnFirstWriteProperties.table
↖org.apache.hadoop.hive.ql.plan.PartitionDesc.properties
↖{j.u.LinkedHashMap}.values
↖org.apache.hadoop.hive.ql.plan.MapWork.pathToPartitionInfo{code}
This is a bit unexpected, given that, as you noticed before, we already take
care of interning this table's values in {{PartitionDesc#internProperties.
}}{{}}Probably some uninterned string values are later added to this table,
probably by the code that obtains this table by calling {{getProperties()}}. I
hope with your knowledge of Hive code you will manage to determine the culprit
here. One more clue is the contents of the duplicate strings coming from these
tables, e.g.
||*Num strings* || *String value* ||
|
36|"hdfs://vc0501.halxg.cloudera.com:8020/user/systest/tpcds_1000_decimal_parquet/store_sales/ss_sold_date_sk=2452497"|
|
36|"hdfs://vc0501.halxg.cloudera.com:8020/user/systest/tpcds_1000_decimal_parquet/store_sales/ss_sold_date_sk=2452422"|
# There are several other sources of duplicate strings that jxray reports.
They cause much less overhead, but some may be still worth fixing. Let me know
if you need help with them. Interestingly, as far as I can see, strings coming
from {{JobConf}} waste just about 0.2% of memory.
Also, as far as I can see in section 2, {{java.util.Properties}} objects
together consume 8.5% of memory, which is significant. But most of that comes
from {{TableDesc#properties}}. {{JobConf#properties}} uses just 0.8% of memory,
so probably not worth optimizing.
> Intern JobConf objects in Spark tasks
> -------------------------------------
>
> Key: HIVE-19937
> URL: https://issues.apache.org/jira/browse/HIVE-19937
> Project: Hive
> Issue Type: Improvement
> Components: Spark
> Reporter: Sahil Takiar
> Assignee: Sahil Takiar
> Priority: Major
> Attachments: HIVE-19937.1.patch, report.html
>
>
> When fixing HIVE-16395, we decided that each new Spark task should clone the
> {{JobConf}} object to prevent any {{ConcurrentModificationException}} from
> being thrown. However, setting this variable comes at a cost of storing a
> duplicate {{JobConf}} object for each Spark task. These objects can take up a
> significant amount of memory, we should intern them so that Spark tasks
> running in the same JVM don't store duplicate copies.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)