[jira] [Commented] (HIVE-12683) Does Tez run slower than hive on larger dataset (~2.5 TB)?

rohit garg (JIRA) Tue, 15 Dec 2015 14:34:03 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-12683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058999#comment-15058999
 ]


rohit garg commented on HIVE-12683:
-----------------------------------

I tried lot of different settings. This was one of them and which worked. It 
came from the below formulas. Not sure if thats the right way to estimate it.

We used the following formulae to guide us in determining YARN and MapReduce 
memory configurations:

Number of containers =  min (2 * cores, 1.8 * disks, (Total available RAM) / 
min_container_size)
Reserved Memory = Memory for stack memory
Total available RAM = Total RAM of the cluster – Reserved Memory
Disks = Number of data disks per machine
min_container_size = Minimum container size (in RAM). Its value is dependent on 
RAM available
RAM-per-container = max(min_container_size, (Total Available RAM) / containers)

For example, for our cluster, we had 32 CPU cores, 244 GB RAM, and 2 disks per 
node.

Reserved Memory = 38 GB
Container Size = 2 GB
Available RAM = (244-38) GB = 206 GB
Number of containers = min (2*32, 1.8* 2, 206/2) = min (64,3.6, 103) = ~4
RAM-per-container = max (2, 206/4) = max (2, 51.5) = ~52 GB



> Does Tez run slower than hive on larger dataset (~2.5 TB)?
> ----------------------------------------------------------
>
>                 Key: HIVE-12683
>                 URL: https://issues.apache.org/jira/browse/HIVE-12683
>             Project: Hive
>          Issue Type: Bug
>            Reporter: rohit garg
>
> We have started to look into testing tez query engine. From initial results, 
> we are getting 30% performance boost over Hive on smaller data set(1-10 GB) 
> but Hive starts to perform better than Tez as data size increases. Like when 
> we run a hive query with Tez on about 2.3 TB worth of data, it performs worse 
> than hive alone.(~20% less performance) Details are in the post below.
> On a cluster with 1.3 TB RAM, I set the following property :
> set tez.task.resource.memory.mb=10000; set tez.am.resource.memory.mb=59205; 
> set tez.am.launch.cmd-opts =-Xmx47364m; set hive.tez.container.size=59205; 
> set hive.tez.java.opts=-Xmx47364m; set tez.am.grouping.max-size=36700160000;
> Is it normal or I am missing some property / not configuring some property 
> properly? Also, I am using an older version of Tez as of now. Could that be 
> the issue too? I still have to bootstrap latest version of Tez on EMR and 
> test it and see if that could do any better.
> Thought of asking here too
> http://www.jwplayer.com/blog/hive-with-tez-on-emr/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-12683) Does Tez run slower than hive on larger dataset (~2.5 TB)?

Reply via email to