On Jun 20, 2014, at 11:27 AM, Lars Selsaas <lars.sels...@thinkbiganalytics.com>
wrote:
I'm also wondering which settings I can play around with to affect this? Say I
want to
make my jobs keep stuff longer.
Thanks,
Lars
On Fri, Jun 20, 2014 at 11:08 AM, Lars Selsaas
<lars.sels...@thinkbiganalytics.com>
wrote:
Thanks!
Hopefully I'm getting the correct logs here:
It seems the same application manager keeps on taking the requests.
They both get the same application ID: application_1403285786962_0002
dag_1403285786962_0004_1.dot : Total file length is 2179 bytes.
dag_1403285786962_0004_2.dot : Total file length is 2179 bytes.
dag_1403285786962_0004_3.dot : Total file length is 2179 bytes.
dag_1403285786962_0004_4.dot : Total file length is 2179 bytes.
stderr : Total file length is 0 bytes.
stderr_dag_1403285786962_0004_1 : Total file length is 0 bytes.
stderr_dag_1403285786962_0004_1_post : Total file length is 0 bytes.
stderr_dag_1403285786962_0004_2 : Total file length is 0 bytes.
stderr_dag_1403285786962_0004_2_post : Total file length is 0 bytes.
stderr_dag_1403285786962_0004_3 : Total file length is 0 bytes.
stderr_dag_1403285786962_0004_3_post : Total file length is 0 bytes.
stderr_dag_1403285786962_0004_4 : Total file length is 0 bytes.
stderr_dag_1403285786962_0004_4_post : Total file length is 0 bytes.
stdout : Total file length is 0 bytes.
stdout_dag_1403285786962_0004_1 : Total file length is 0 bytes.
stdout_dag_1403285786962_0004_1_post : Total file length is 0 bytes.
stdout_dag_1403285786962_0004_2 : Total file length is 0 bytes.
stdout_dag_1403285786962_0004_2_post : Total file length is 0 bytes.
stdout_dag_1403285786962_0004_3 : Total file length is 0 bytes.
stdout_dag_1403285786962_0004_3_post : Total file length is 0 bytes.
stdout_dag_1403285786962_0004_4 : Total file length is 0 bytes.
stdout_dag_1403285786962_0004_4_post : Total file length is 0 bytes.
syslog : Total file length is 7577 bytes.
syslog_dag_1403285786962_0004_1 : Total file length is 57034 bytes.
syslog_dag_1403285786962_0004_1_post : Total file length is 4775 bytes.
syslog_dag_1403285786962_0004_2 : Total file length is 56104 bytes.
syslog_dag_1403285786962_0004_2_post : Total file length is 707 bytes.
syslog_dag_1403285786962_0004_3 : Total file length is 53187 bytes.
syslog_dag_1403285786962_0004_3_post : Total file length is 5003 bytes.
syslog_dag_1403285786962_0004_4 : Total file length is 56111 bytes.
syslog_dag_1403285786962_0004_4_post : Total file length is 4204 bytes.
fast run
Map 1 1 734 Bytes 438 Bytes 639 ms
Map 2 1 245 KB 478 Bytes 1.34 secs
Reducer 3 1 446 Bytes 557 Bytes 3.63 secs
slow run
Map 1 1 734 Bytes 438 Bytes 12.62 secs
Map 2 1 245 KB 478 Bytes 14.37 secs
Reducer 3 1 446 Bytes 557 Bytes 15.67 secs
On Fri, Jun 20, 2014 at 10:31 AM, Hitesh Shah <hit...@apache.org> wrote:
Hello Lars,
Just to be very clear - there is no caching of results/data across queries
except for
some minimal meta-data caching for ORC. If you can send across the logs
generated by “yarn
logs -applicationId <appId>”, we can try and help you get a better
understanding of
where the speed difference is stemming from.
— HItesh
On Jun 20, 2014, at 10:13 AM, Bikas Saha <bi...@hortonworks.com> wrote:
> Hi,
>
> Thanks for your interest in trying out Hive on Tez. There are multiple
reasons for
the observations you see below.
> 1) Containers are warmed up the longer they get used. So if you
repeatedly
run queries then the JVM has all classes loaded and ready and may have JIT-ed
the frequently
run code path. As it learns more about your execution pattern, the JIT can do a
better job.
This will help you across different queries.
> 2) As you frequently access the same data from the OS it will increase
the
chances of your finding that data in the OS buffer cache. So you get the
benefits of in-memory
data JThis will help repeated runs of queries on the same data.
> 3) Hive is smart about explicitly caching de-serialized (Java objects)
within
query in order to reduce re-computation of work that has already been done.
This will help
within a query.
> 4) If you are using the ORC file then Hive will try to cache ORC file
metadata
like locations/sizes etc. and this helps different queries that access the same
data.
> 5) If your Tez query session has been idle for some time, then the
system starts
pro-actively releasing resources back to the cluster so that they may be used
by other applications
(good for multi-tenancy). So if you fire a query after some delay then a
slowdown will be
observed in case we need to reclaim some of the released resources. This delay
is configurable.
>
> Hope this helps and you have a positive experience experimenting with Hive on
Tez.
> Please let us know how we can help!
> Bikas
>
> From: Lars Selsaas [mailto:lars.sels...@thinkbiganalytics.com]
> Sent: Friday, June 20, 2014 8:50 AM
> To: user
> Subject: Tez performance on Hive
>
> Hi,
>
> So when you set Tez as the execution engine for Hive it takes about half the
time
to finish a query the second time you run it going from say 24 seconds to 12
seconds. but
if I keep re running it it gets down to about 2 seconds on that same query. The
speed goes
up to 12 seconds if I wait to long before the next rerun or if I do large
enough adjustments
to the query.
>
>
> So I'm working on a blogpost about Tez and need to find out why this is
happening.
The first reduced speed seem to mainly just be because of hot containers that
store the information
about where to find your data. While the seconds reduce down to about 2 sec
seems to be some
in memory storage of the data. Does it store the results in memory and keep it
ready for next
time or?
>
>
>
> --
> <~WRD018.jpg>
> Lars Selsaas
> Data Engineer
> Think Big Analytics
> lars.sels...@thinkbiganalytics.com
> 650-537-5321
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
which
it is addressed and may contain information that is confidential, privileged
and exempt from
disclosure under applicable law. If the reader of this message is not the
intended recipient,
you are hereby notified that any printing, copying, dissemination,
distribution, disclosure
or forwarding of this communication is strictly prohibited. If you have
received this communication
in error, please contact the sender immediately and delete it from your system.
Thank You.
--
Lars Selsaas
Data Engineer
Think Big Analytics
lars.sels...@thinkbiganalytics.com
650-537-5321
--
Lars Selsaas
Data Engineer
Think Big Analytics
lars.sels...@thinkbiganalytics.com
650-537-5321