[ 
https://issues.apache.org/jira/browse/HIVE-15882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15881399#comment-15881399
 ] 

Misha Dmitriev commented on HIVE-15882:
---------------------------------------

I've just measured the CPU performance impact of my changes using the same 
benchmark with the same high heap size (-Xmx3g) to exclude effects of excessive 
GC. I've measured the total time spent in all beeline clients. To do that, I 
ran beeline clients with /usr/bin/time as

{code}
for i in `seq 1 50`; do /usr/bin/time -p -o hive-timings-withchanges.txt 
--append beeline -u jdbc:hive2://localhost:10000 -n admin -p admin -e "select 
count(i_f_1) from misha_table;" & done
{code}

I then calculated the sum of all timings in the file with another fun bash 
script:

{code}
sum=0; for s in `grep real hive-timings-withchanges.txt`; do t=${s/real/}; 
t=${t/\.*/}; echo $t; sum=$((sum+t)); done; echo $sum
{code}

The result is:
- before my changes: 17401s
- after my changes: 17012s

So, my changes have no negative CPU impact, and may even result in 1-2% CPU 
time improvement. This is not surprising given that my changes reduce the 
number of objects in memory, and thus ultimately reduce GC time.

Do I really need another JIRA ticket to post a patch that covers my other 
change (interning Properties objects in PartitionDesc)?

> HS2 generating high memory pressure with many partitions and concurrent 
> queries
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-15882
>                 URL: https://issues.apache.org/jira/browse/HIVE-15882
>             Project: Hive
>          Issue Type: Improvement
>          Components: HiveServer2
>            Reporter: Misha Dmitriev
>            Assignee: Misha Dmitriev
>         Attachments: HIVE-15882.01.patch, hs2-crash-2000p-500m-50q.txt
>
>
> I've created a Hive table with 2000 partitions, each backed by two files, 
> with one row in each file. When I execute some number of concurrent queries 
> against this table, e.g. as follows
> {code}
> for i in `seq 1 50`; do beeline -u jdbc:hive2://localhost:10000 -n admin -p 
> admin -e "select count(i_f_1) from misha_table;" & done
> {code}
> it results in a big memory spike. With 20 queries I caused an OOM in a HS2 
> server with -Xmx200m and with 50 queries - in the one with -Xmx500m.
> I am attaching the results of jxray (www.jxray.com) analysis of a heap dump 
> that was generated in the 50queries/500m heap scenario. It suggests that 
> there are several opportunities to reduce memory pressure with not very 
> invasive changes to the code:
> 1. 24.5% of memory is wasted by duplicate strings (see section 6). With 
> String.intern() calls added in the ~10 relevant places in the code, this 
> overhead can be highly reduced.
> 2. Almost 20% of memory is wasted due to various suboptimally used 
> collections (see section 8). There are many maps and lists that are either 
> empty or have just 1 element. By modifying the code that creates and 
> populates these collections, we may likely save 5-10% of memory.
> 3. Almost 20% of memory is used by instances of java.util.Properties. It 
> looks like these objects are highly duplicate, since for each Partition each 
> concurrently running query creates its own copy of Partion, PartitionDesc and 
> Properties. Thus we have nearly 100,000 (50 queries * 2,000 partitions) 
> Properties in memory. By interning/deduplicating these objects we may be able 
> to save perhaps 15% of memory.
> So overall, I think there is a good chance to reduce HS2 memory consumption 
> in this scenario by ~40%.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to