[
https://issues.apache.org/jira/browse/TEZ-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14340796#comment-14340796
]
Bikas Saha commented on TEZ-2148:
---------------------------------
[~oae] Do you have access to the tez-site.xml? Since this happens with busy
clusters I am guessing that this is a result of contention between multiple
sessions such that newer sessions starve when older sessions have not yet
released idle resources.
Do you observe that the first Tez flow is comparable/faster than the first MR
flow and the subsequent Tez flows start slowing down compared to MR? Do
multiple sets of 4 DAGs go to the same Tez session or do you close the session
after the 4 DAGs have completed and start a new session for the next query?
What are the values of tez.am.container.idle.release-timeout-min.millis and
tez.am.container.idle.release-timeout-max.millis and
tez.am.session.min.held-containers? Their default values are 5s 10s and 0.
Which means that idle resource will be released between 5 to 10s and all idle
containers will be released.
One way to test this theory would be to turn the min/max timeout to low values
like 250 ms or 100ms and try it out. This should release containers quickly to
other sessions.
If you are submitting multiple queries to a pool of running sessions (where
each query can be multiple DAGs) then what you could do it set low values of
min/max timeouts such that they serve your need for within DAG reuse and set a
value of tez.am.session.min.held-containers such that your sessions have enough
min held containers to get good initial latency while new containers are being
acquired to complete the remainder of the DAG. min-held-containers are tried to
be spread evenly across the machines in the clusters for good initial locality.
So having 1 per machine (if possible) would be good.
> Slow container grabbing with Capacity Scheduler in comparision to MapReduce
> ---------------------------------------------------------------------------
>
> Key: TEZ-2148
> URL: https://issues.apache.org/jira/browse/TEZ-2148
> Project: Apache Tez
> Issue Type: Task
> Affects Versions: 0.5.1
> Reporter: Johannes Zillmann
> Attachments: TEZ-2148.svg, applicationLogs.zip,
> capacity-scheduler.xml, client-mapreduce.log, client-tez.log, dag1.pdf,
> dag2.pdf, dag3.pdf, dag4.pdf
>
>
> A customer experienced the following:
> - Setup a CapacityScheduler for user 'company'
> - Same processing job on same data is faster with MapReduce then with Tez
> with "normal" cluster business. Only if nothing else runs on Hadoop then Tez
> outperforms MapReduce. (Its hard to give exact data here since we get every
> information second hand from the customer, but the timings were pretty stable
> over a dozen of runs. The MapReduce job in about 70 sec and Tez in about 170
> sec.)
> So questions is, is there some difference in how Tez is grabbing resources
> from the capacity scheduler in difference to MapReduce ?
> Looking at the logs it looks like Tez is always very slow in starting the
> containers where as MapReduce parallelizes very quickly.
> Attached client and application logs for Tez and MapReduce run as well as the
> scheduler configuration.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)