Re: Tez performance on Hive

Gopal V Tue, 24 Jun 2014 20:46:17 -0700

On 6/20/14, 11:42 AM, Hitesh Shah wrote:

The main config to control how long containers are kept for is 
"tez.am.container.session.delay-allocation-millis”.
Setting this to a higher value will tell the AM to retain containers for a 
longer period.
Increasing this though will have a negative effect on other users in the 
cluster as idle resources
will be retained by the tez application.

There are 2 primary settings for Tez to keep things running in case youwant to let a single user monopolize a cluster (with Pig, Hive, pickyour layer above Tez).


tez.session.am.dag.submit.timeout.secs=300

tez.am.container.session.delay-allocation-millis=10000

The first setting is how long Tez stays up as a YARN application when itis idle - the default setting is 300s. A bit of time is wasted when youlose the session because hive-13 ties its exec JAR with a session & itwould trigger a Gb or so of JAR copy traffic on a large(ish) cluster.

The second setting controls how long a container idles before it getskilled - setting this to a high value will fill up your current queuewith idle containers. Other queues will be able to kill those andreclaim space with priority - but most people run multiple users/queriesin one queue while testing, so this is turned down.

You can set the first to 10 minutes and the second one to 10 minutes, toget predictable results there.

There are few "because we can" settings in there which work for small1-3 node clusters (where all allocations are host local, because of3-replica HDFS) - at 1.7Mb of data & 2s, I'm assuming that.


tez.am.am-rm.heartbeat.interval-ms.max=10
tez.task.get-task.sleep.interval-ms.max=10
tez.runtime.broadcast.data-via-events.enabled=true
tez.runtime.broadcast.data-via-events.max-size=4096

FYI, if you are using hive, you need to provide these items either inthe tez-site.xml or as -hiveconf options. By the time you get to a hive>prompt, it is too late to set these up.


Cheers,
Gopal

On Jun 20, 2014, at 11:27 AM, Lars Selsaas <lars.sels...@thinkbiganalytics.com> 
wrote:

I'm also wondering which settings I can play around with to affect this? Say I 
want to

make my jobs keep stuff longer.


Thanks,
Lars


On Fri, Jun 20, 2014 at 11:08 AM, Lars Selsaas 
<lars.sels...@thinkbiganalytics.com>

wrote:

Thanks!

Hopefully I'm getting the correct logs here:

It seems the same application manager keeps on taking the requests.

They both get the same application ID: application_1403285786962_0002
dag_1403285786962_0004_1.dot : Total file length is 2179 bytes.
dag_1403285786962_0004_2.dot : Total file length is 2179 bytes.
dag_1403285786962_0004_3.dot : Total file length is 2179 bytes.
dag_1403285786962_0004_4.dot : Total file length is 2179 bytes.
stderr : Total file length is 0 bytes.
stderr_dag_1403285786962_0004_1 : Total file length is 0 bytes.
stderr_dag_1403285786962_0004_1_post : Total file length is 0 bytes.
stderr_dag_1403285786962_0004_2 : Total file length is 0 bytes.
stderr_dag_1403285786962_0004_2_post : Total file length is 0 bytes.
stderr_dag_1403285786962_0004_3 : Total file length is 0 bytes.
stderr_dag_1403285786962_0004_3_post : Total file length is 0 bytes.
stderr_dag_1403285786962_0004_4 : Total file length is 0 bytes.
stderr_dag_1403285786962_0004_4_post : Total file length is 0 bytes.
stdout : Total file length is 0 bytes.
stdout_dag_1403285786962_0004_1 : Total file length is 0 bytes.
stdout_dag_1403285786962_0004_1_post : Total file length is 0 bytes.
stdout_dag_1403285786962_0004_2 : Total file length is 0 bytes.
stdout_dag_1403285786962_0004_2_post : Total file length is 0 bytes.
stdout_dag_1403285786962_0004_3 : Total file length is 0 bytes.
stdout_dag_1403285786962_0004_3_post : Total file length is 0 bytes.
stdout_dag_1403285786962_0004_4 : Total file length is 0 bytes.
stdout_dag_1403285786962_0004_4_post : Total file length is 0 bytes.
syslog : Total file length is 7577 bytes.
syslog_dag_1403285786962_0004_1 : Total file length is 57034 bytes.
syslog_dag_1403285786962_0004_1_post : Total file length is 4775 bytes.
syslog_dag_1403285786962_0004_2 : Total file length is 56104 bytes.
syslog_dag_1403285786962_0004_2_post : Total file length is 707 bytes.
syslog_dag_1403285786962_0004_3 : Total file length is 53187 bytes.
syslog_dag_1403285786962_0004_3_post : Total file length is 5003 bytes.
syslog_dag_1403285786962_0004_4 : Total file length is 56111 bytes.
syslog_dag_1403285786962_0004_4_post : Total file length is 4204 bytes.

fast run

Map 1    1       734 Bytes       438 Bytes       639 ms
Map 2   1        245 KB 478 Bytes       1.34 secs
Reducer 3        1       446 Bytes       557 Bytes       3.63 secs



slow run

Map 1    1       734 Bytes       438 Bytes       12.62 secs
Map 2   1        245 KB 478 Bytes       14.37 secs
Reducer 3        1       446 Bytes       557 Bytes       15.67 secs



On Fri, Jun 20, 2014 at 10:31 AM, Hitesh Shah <hit...@apache.org> wrote:
Hello Lars,

Just to be very clear - there is no caching of results/data across queries 
except for

some minimal meta-data caching for ORC. If you can send across the logs 
generated by “yarn
logs -applicationId <appId>”, we can try and help you get a better 
understanding of
where the speed difference is stemming from.


— HItesh

On Jun 20, 2014, at 10:13 AM, Bikas Saha <bi...@hortonworks.com> wrote:

> Hi,
>
> Thanks for your interest in trying out Hive on Tez. There are multiple 
reasons for

the observations you see below.

> 1)      Containers are warmed up the longer they get used. So if you 
repeatedly

run queries then the JVM has all classes loaded and ready and may have JIT-ed 
the frequently
run code path. As it learns more about your execution pattern, the JIT can do a 
better job.
This will help you across different queries.

> 2)      As you frequently access the same data from the OS it will increase 
the

chances of your finding that data in the OS buffer cache. So you get the 
benefits of in-memory
data JThis will help repeated runs of queries on the same data.

> 3)      Hive is smart about explicitly caching de-serialized (Java objects) 
within

query in order to reduce re-computation of work that has already been done. 
This will help
within a query.

> 4)      If you are using the ORC file then Hive will try to cache ORC file 
metadata

like locations/sizes etc. and this helps different queries that access the same 
data.

> 5)      If your Tez query session has been idle for some time, then the 
system starts

pro-actively releasing resources back to the cluster so that they may be used 
by other applications
(good for multi-tenancy). So if you fire a query after some delay then a 
slowdown will be
observed in case we need to reclaim some of the released resources. This delay 
is configurable.

>
> Hope this helps and you have a positive experience experimenting with Hive on 
Tez.
> Please let us know how we can help!
> Bikas
>
> From: Lars Selsaas [mailto:lars.sels...@thinkbiganalytics.com]
> Sent: Friday, June 20, 2014 8:50 AM
> To: user
> Subject: Tez performance on Hive
>
> Hi,
>
> So when you set Tez as the execution engine for Hive it takes about half the 
time

to finish a query the second time you run it going from say 24 seconds to 12 
seconds. but
if I keep re running it it gets down to about 2 seconds on that same query. The 
speed goes
up to 12 seconds if I wait to long before the next rerun or if I do large 
enough adjustments
to the query.

>
>
> So I'm working on a blogpost about Tez and need to find out why this is 
happening.

The first reduced speed seem to mainly just be because of hot containers that 
store the information
about where to find your data. While the seconds reduce down to about 2 sec 
seems to be some
in memory storage of the data. Does it store the results in memory and keep it 
ready for next
time or?

>
>
>
> --
> <~WRD018.jpg>
> Lars Selsaas
> Data Engineer
> Think Big Analytics
> lars.sels...@thinkbiganalytics.com
> 650-537-5321
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to 
which

it is addressed and may contain information that is confidential, privileged 
and exempt from
disclosure under applicable law. If the reader of this message is not the 
intended recipient,
you are hereby notified that any printing, copying, dissemination, 
distribution, disclosure
or forwarding of this communication is strictly prohibited. If you have 
received this communication
in error, please contact the sender immediately and delete it from your system. 
Thank You.





--
        
Lars Selsaas
Data Engineer
Think Big Analytics
lars.sels...@thinkbiganalytics.com
650-537-5321




--
        
Lars Selsaas
Data Engineer
Think Big Analytics
lars.sels...@thinkbiganalytics.com
650-537-5321

Re: Tez performance on Hive

Reply via email to