[
https://issues.apache.org/jira/browse/HIVE-19937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537341#comment-16537341
]
Sahil Takiar commented on HIVE-19937:
-------------------------------------
Attached an updated patch and re-purposed this JIRA based on further analysis.
Some key points I considered:
* Lots of interning is already being done in {{MapWork}} and associated classes
like {{PartitionDesc}}
** The issue is that this interning only takes affect within HS2, but not
inside Spark executors
** The problem is that when Kryo de-serialized {{MapWork}} objects, it does so
in way that avoids any calls to the intern methods
* The proposed solution in this patch is to use Kryo's {{BeanSerializer}}
instead of the default {{FieldSerializer}}
** The advantage is that the {{BeanSerializer}} with use the setter methods
inside {{MapWork}} during de-serialization, which should trigger the intern
methods, the drawback is that {{BeanSerializer}} is slightly slower than
{{FieldSerializer}}
This eliminates a lot of the string duplication, but overhead due to duplicate
{{Properties}} objects is still an issue.
Another issue I encountered is with {{CopyOnFirstWriteProperties}}.
* While this class works well for HS2, I think it needs to be modified if we
want to mimic its behavior inside Spark executors
* The main issue is that the {{CopyOnFirstWritePropertiesSerializer}}
serializes the interned {{Properties}} object for each
{{CopyOnFirstWriteProperties}} object, so during de-serialization a separate
{{Properties}} object is used to back each {{CopyOnFirstWriteProperties}}, so
we lose most of the benefits of this class
* I haven't fully figured out the best way to address this issue, but I think
its complex enough to warrant a separate JIRA
> Use BeanSerializer for MapWork to carry calls to String.intern
> --------------------------------------------------------------
>
> Key: HIVE-19937
> URL: https://issues.apache.org/jira/browse/HIVE-19937
> Project: Hive
> Issue Type: Improvement
> Components: Spark
> Reporter: Sahil Takiar
> Assignee: Sahil Takiar
> Priority: Major
> Attachments: HIVE-19937.1.patch, HIVE-19937.2.patch, report.html
>
>
> When fixing HIVE-16395, we decided that each new Spark task should clone the
> {{JobConf}} object to prevent any {{ConcurrentModificationException}} from
> being thrown. However, setting this variable comes at a cost of storing a
> duplicate {{JobConf}} object for each Spark task. These objects can take up a
> significant amount of memory, we should intern them so that Spark tasks
> running in the same JVM don't store duplicate copies.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)