[jira] [Updated] (HIVE-19668) 11.8% of the heap wasted due to duplicate org.antlr.runtime.CommonToken's

Misha Dmitriev (JIRA) Fri, 01 Jun 2018 16:51:36 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-19668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Misha Dmitriev updated HIVE-19668:
----------------------------------
    Description: 
I've recently analyzed a HS2 heap dump, obtained when there was a huge memory 
spike during compilation of some big query. The analysis was done with jxray 
([www.jxray.com).|http://www.jxray.com)./] It turns out that more than 90% of 
the 20G heap was used by data structures associated with query parsing 
({{org.apache.hadoop.hive.ql.parse.QBExpr}}). There are probably multiple 
opportunities for optimizations here. One of them is to stop the code from 
creating duplicate instances of {{org.antlr.runtime.CommonToken}} class. See a 
sample of these objects in the attached image:

!image-2018-05-22-17-41-39-572.png|width=879,height=399!

Looks like these particular {{CommonToken}} objects are constants, that don't 
change once created. I see some code, e.g. in 
{{org.apache.hadoop.hive.ql.parse.CalcitePlanner}}, where such objects are 
apparently repeatedly created with e.g. {{new 
CommonToken(HiveParser.TOK_INSERT, "TOK_INSERT")}} If these 33 token kinds are 
instead created once and reused, we will save more than 1/10th of the heap in 
this scenario. Plus, since these objects are small but very numerous, getting 
rid of them will remove a gread deal of pressure from the GC.

Another source of waste are duplicate strings, that collectively waste 26.1% of 
memory. Some of them come from CommonToken objects that have the same text 
(i.e. for multiple CommonToken objects the contents of their 'text' Strings are 
the same, but each has its own copy of that String). Other duplicate strings 
come from other sources, that are easy enough to fix by adding String.intern() 
calls.

  was:
I've recently analyzed a HS2 heap dump, obtained when there was a huge memory 
spike during compilation of some big query. The analysis was done with jxray 
([www.jxray.com).|http://www.jxray.com)./] It turns out that more than 90% of 
the 20G heap was used by data structures associated with query parsing 
({{org.apache.hadoop.hive.ql.parse.QBExpr}}). There are probably multiple 
opportunities for optimizations here. One of them is to stop the code from 
creating duplicate instances of {{org.antlr.runtime.CommonToken}} class. See a 
sample of these objects in the attached image:

!image-2018-05-22-17-41-39-572.png|width=879,height=399!

Looks like these particular {{CommonToken}} objects are constants, that don't 
change once created. I see some code, e.g. in 
{{org.apache.hadoop.hive.ql.parse.CalcitePlanner}}, where such objects are 
apparently repeatedly created with e.g. {{new 
CommonToken(HiveParser.TOK_INSERT, "TOK_INSERT")}} If these 33 token kinds are 
instead created once and reused, we will save more than 1/10th of the heap in 
this scenario. Plus, since these objects are small but very numerous, getting 
rid of them will remove a gread deal of pressure from the GC.


> 11.8% of the heap wasted due to duplicate org.antlr.runtime.CommonToken's
> -------------------------------------------------------------------------
>
>                 Key: HIVE-19668
>                 URL: https://issues.apache.org/jira/browse/HIVE-19668
>             Project: Hive
>          Issue Type: Improvement
>          Components: HiveServer2
>    Affects Versions: 3.0.0
>            Reporter: Misha Dmitriev
>            Assignee: Misha Dmitriev
>            Priority: Major
>         Attachments: image-2018-05-22-17-41-39-572.png
>
>
> I've recently analyzed a HS2 heap dump, obtained when there was a huge memory 
> spike during compilation of some big query. The analysis was done with jxray 
> ([www.jxray.com).|http://www.jxray.com)./] It turns out that more than 90% of 
> the 20G heap was used by data structures associated with query parsing 
> ({{org.apache.hadoop.hive.ql.parse.QBExpr}}). There are probably multiple 
> opportunities for optimizations here. One of them is to stop the code from 
> creating duplicate instances of {{org.antlr.runtime.CommonToken}} class. See 
> a sample of these objects in the attached image:
> !image-2018-05-22-17-41-39-572.png|width=879,height=399!
> Looks like these particular {{CommonToken}} objects are constants, that don't 
> change once created. I see some code, e.g. in 
> {{org.apache.hadoop.hive.ql.parse.CalcitePlanner}}, where such objects are 
> apparently repeatedly created with e.g. {{new 
> CommonToken(HiveParser.TOK_INSERT, "TOK_INSERT")}} If these 33 token kinds 
> are instead created once and reused, we will save more than 1/10th of the 
> heap in this scenario. Plus, since these objects are small but very numerous, 
> getting rid of them will remove a gread deal of pressure from the GC.
> Another source of waste are duplicate strings, that collectively waste 26.1% 
> of memory. Some of them come from CommonToken objects that have the same text 
> (i.e. for multiple CommonToken objects the contents of their 'text' Strings 
> are the same, but each has its own copy of that String). Other duplicate 
> strings come from other sources, that are easy enough to fix by adding 
> String.intern() calls.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-19668) 11.8% of the heap wasted due to duplicate org.antlr.runtime.CommonToken's

Reply via email to