[ 
https://issues.apache.org/jira/browse/HIVE-19937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537341#comment-16537341
 ] 

Sahil Takiar commented on HIVE-19937:
-------------------------------------

Attached an updated patch and re-purposed this JIRA based on further analysis. 
Some key points I considered:
* Lots of interning is already being done in {{MapWork}} and associated classes 
like {{PartitionDesc}}
** The issue is that this interning only takes affect within HS2, but not 
inside Spark executors
** The problem is that when Kryo de-serialized {{MapWork}} objects, it does so 
in way that avoids any calls to the intern methods
* The proposed solution in this patch is to use Kryo's {{BeanSerializer}} 
instead of the default {{FieldSerializer}}
** The advantage is that the {{BeanSerializer}} with use the setter methods 
inside {{MapWork}} during de-serialization, which should trigger the intern 
methods, the drawback is that {{BeanSerializer}} is slightly slower than 
{{FieldSerializer}}

This eliminates a lot of the string duplication, but overhead due to duplicate 
{{Properties}} objects is still an issue.

Another issue I encountered is with {{CopyOnFirstWriteProperties}}.
* While this class works well for HS2, I think it needs to be modified if we 
want to mimic its behavior inside Spark executors
* The main issue is that the {{CopyOnFirstWritePropertiesSerializer}} 
serializes the interned {{Properties}} object for each 
{{CopyOnFirstWriteProperties}} object, so during de-serialization a separate 
{{Properties}} object is used to back each {{CopyOnFirstWriteProperties}}, so 
we lose most of the benefits of this class
* I haven't fully figured out the best way to address this issue, but I think 
its complex enough to warrant a separate JIRA

> Use BeanSerializer for MapWork to carry calls to String.intern
> --------------------------------------------------------------
>
>                 Key: HIVE-19937
>                 URL: https://issues.apache.org/jira/browse/HIVE-19937
>             Project: Hive
>          Issue Type: Improvement
>          Components: Spark
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Major
>         Attachments: HIVE-19937.1.patch, HIVE-19937.2.patch, report.html
>
>
> When fixing HIVE-16395, we decided that each new Spark task should clone the 
> {{JobConf}} object to prevent any {{ConcurrentModificationException}} from 
> being thrown. However, setting this variable comes at a cost of storing a 
> duplicate {{JobConf}} object for each Spark task. These objects can take up a 
> significant amount of memory, we should intern them so that Spark tasks 
> running in the same JVM don't store duplicate copies.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to