[
https://issues.apache.org/jira/browse/TEZ-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16083231#comment-16083231
]
Rajesh Balamohan commented on TEZ-3754:
---------------------------------------
This is not an infinite loop similar to the one which got fixed in TEZ-1923.
IMO, this issue is not due to regression.
Observed this scenario with large ETL job. Fetchers get a mix of small/large
dataset to download. Commit memory reaches closer to memory threshold. In such
case, when large data (< single shuffle memory limit) needs to be downloaded,
it ends up with WAIT state. However, other fetchers downloading smaller data
release memory quickly and the larger dataset to be downloaded gets scheduled
again. However, cleared up memory is still not good enough and it ends up
getting into WAIT state again. This happens frequently in short span of time in
corner cases.
When larger number of machines participate in the same vertex, chances of
getting the source node getting into pressure becomes much higher, as the data
read is wasted.
This need not be a blocker for 0.9.
> With large ordered fetches, stalling shuffle could lead to slowdown of job
> --------------------------------------------------------------------------
>
> Key: TEZ-3754
> URL: https://issues.apache.org/jira/browse/TEZ-3754
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Rajesh Balamohan
> Assignee: Rajesh Balamohan
> Attachments: TEZ-3754.1.patch
>
>
> When large ordered fetches are getting downloaded, depending on the memory
> availability in merger, it is possible to stall the shuffle ({{Type::WAIT}}).
> This happens after establishing the connection,validation at server side etc.
> However, depending on memory availability allocator could return
> {{Type::WAIT}} to retry fetching later.
> In corner cases, retries happen very frequently and this creates pressure on
> the server side. Server side populates the headers and starts sending the map
> output. This effort is wasted in the client side as data is not downloaded.
> When multiple nodes have similar issue, following connections timesout (180
> seconds), which turns out to be
> very expensive.
> {noformat}
> 2017-06-06 04:27:49,574 [INFO] [fetcher {Map_1} #44]
> |orderedgrouped.FetcherOrderedGrouped|: fetcher#44 - MergerManager returned
> Status.WAIT ...
> 2017-06-06 04:27:49,575 [INFO] [fetcher {Map_1} #31]
> |shuffle.HttpConnection|: for
> url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000843_0_10007,attempt_1496458948260_0050_1_00_000841_1_10008
> sent hash and receievd reply 0 ms
> 2017-06-06 04:27:49,575 [INFO] [fetcher {Map_1} #31]
> |orderedgrouped.FetcherOrderedGrouped|: fetcher#31 - MergerManager returned
> Status.WAIT ...
> 2017-06-06 04:27:49,576 [INFO] [fetcher {Map_1} #43]
> |shuffle.HttpConnection|: for
> url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000841_1_10008,attempt_1496458948260_0050_1_00_000843_0_10007
> sent hash and receievd reply 0 ms
> 2017-06-06 04:27:49,576 [INFO] [fetcher {Map_1} #43]
> |orderedgrouped.FetcherOrderedGrouped|: fetcher#43 - MergerManager returned
> Status.WAIT ...
> 2017-06-06 04:27:49,578 [INFO] [fetcher {Map_1} #32]
> |shuffle.HttpConnection|: for
> url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000843_0_10007,attempt_1496458948260_0050_1_00_000841_1_10008
> sent hash and receievd reply 0 ms
> 2017-06-06 04:27:49,578 [INFO] [fetcher {Map_1} #32]
> |orderedgrouped.FetcherOrderedGrouped|: fetcher#32 - MergerManager returned
> Status.WAIT ...
> 2017-06-06 04:27:49,579 [INFO] [fetcher {Map_1} #42]
> |shuffle.HttpConnection|: for
> url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000841_1_10008,attempt_1496458948260_0050_1_00_000843_0_10007
> sent hash and receievd reply 0 ms
> 2017-06-06 04:27:49,579 [INFO] [fetcher {Map_1} #42]
> |orderedgrouped.FetcherOrderedGrouped|: fetcher#42 - MergerManager returned
> Status.WAIT ...
> 2017-06-06 04:27:49,580 [INFO] [fetcher {Map_1} #33]
> |shuffle.HttpConnection|: for
> url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000843_0_10007,attempt_1496458948260_0050_1_00_000841_1_10008
> sent hash and receievd reply 0 ms
> 2017-06-06 04:27:49,580 [INFO] [fetcher {Map_1} #33]
> |orderedgrouped.FetcherOrderedGrouped|: fetcher#33 - MergerManager returned
> Status.WAIT ...
> 2017-06-06 04:27:49,581 [INFO] [fetcher {Map_1} #41]
> |shuffle.HttpConnection|: for
> url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000841_1_10008,attempt_1496458948260_0050_1_00_000843_0_10007
> sent hash and receievd reply 0 ms
> 2017-06-06 04:27:49,581 [INFO] [fetcher {Map_1} #41]
> |orderedgrouped.FetcherOrderedGrouped|: fetcher#41 - MergerManager returned
> Status.WAIT ...
> 2017-06-06 04:27:49,583 [INFO] [fetcher {Map_1} #34]
> |shuffle.HttpConnection|: for
> url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000843_0_10007,attempt_1496458948260_0050_1_00_000841_1_10008
> sent hash and receievd reply 0 ms
> 2017-06-06 04:27:49,583 [INFO] [fetcher {Map_1} #34]
> |orderedgrouped.FetcherOrderedGrouped|: fetcher#34 - MergerManager returned
> Status.WAIT ...
> 2017-06-06 04:27:49,584 [INFO] [fetcher {Map_1} #40]
> |shuffle.HttpConnection|: for
> url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000841_1_10008,attempt_1496458948260_0050_1_00_000843_0_10007
> sent hash and receievd reply 0 ms
> 2017-06-06 04:27:49,584 [INFO] [fetcher {Map_1} #40]
> |orderedgrouped.FetcherOrderedGrouped|: fetcher#40 - MergerManager returned
> Status.WAIT ...
> 2017-06-06 04:27:49,586 [INFO] [fetcher {Map_1} #35]
> |shuffle.HttpConnection|: for
> url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000843_0_10007,attempt_1496458948260_0050_1_00_000841_1_10008
> sent hash and receievd reply 0 ms
> 2017-06-06 04:27:49,586 [INFO] [fetcher {Map_1} #35]
> |orderedgrouped.FetcherOrderedGrouped|: fetcher#35 - MergerManager returned
> Status.WAIT ...
> 2017-06-06 04:27:49,587 [INFO] [fetcher {Map_1} #39]
> |shuffle.HttpConnection|: for
> url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000841_1_10008,attempt_1496458948260_0050_1_00_000843_0_10007
> sent hash and receievd reply 0 ms
> 2017-06-06 04:27:49,587 [INFO] [fetcher {Map_1} #39]
> |orderedgrouped.FetcherOrderedGrouped|: fetcher#39 - MergerManager returned
> Status.WAIT ...
> 2017-06-06 04:27:49,588 [INFO] [fetcher {Map_1} #36]
> |shuffle.HttpConnection|: for
> url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000843_0_10007,attempt_1496458948260_0050_1_00_000841_1_10008
> sent hash and receievd reply 0 ms
> 2017-06-06 04:27:49,588 [INFO] [fetcher {Map_1} #36]
> |orderedgrouped.FetcherOrderedGrouped|: fetcher#36 - MergerManager returned
> Status.WAIT ...
> 2017-06-06 04:27:49,589 [INFO] [fetcher {Map_1} #38]
> |shuffle.HttpConnection|: for
> url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000841_1_10008,attempt_1496458948260_0050_1_00_000843_0_10007
> sent hash and receievd reply 0 ms
> 2017-06-06 04:27:49,589 [INFO] [fetcher {Map_1} #38]
> |orderedgrouped.FetcherOrderedGrouped|: fetcher#38 - MergerManager returned
> Status.WAIT ...
> 2017-06-06 04:27:49,590 [INFO] [fetcher {Map_1} #37]
> |shuffle.HttpConnection|: for
> url=http://ctr-e133-1493418528701-76897-01-000005:13562/mapOutput?job=job_1496458948260_0050&reduce=1931&map=attempt_1496458948260_0050_1_00_000843_0_10007,attempt_1496458948260_0050_1_00_000841_1_10008
> sent hash and receievd reply 0 ms
> {noformat}
> In such cases, it would be good to wait until {{usedMemory}} gets released
> during "next" iteration as opposed to retrying immediately.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)