Rajesh Balamohan created TEZ-2719:
-------------------------------------
Summary: Reduce logging in unordered fetcher with shared-fetch
option
Key: TEZ-2719
URL: https://issues.apache.org/jira/browse/TEZ-2719
Project: Apache Tez
Issue Type: Bug
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan
For large broadcast, this can be a problem
e.g
In one of the jobs (query_17 @ 10 TB scale), Map 7 generates around 1.1 GB of
data which is given to 330 tasks in downstream Map 1.
Map 1 uses all slots in cluster (~ 224 per wave). Until data is downloaded,
shared fetch would end up re-queuing fetches. As a part of it, it would end up
printing 3 logs per attempt. E.g
{noformat}
2015-08-14 02:09:11,761 INFO [Fetcher [Map_7] #0] shuffle.Fetcher: Requeuing
machine1:13562 downloads because we didn't get a lock
2015-08-14 02:09:11,761 INFO [Fetcher [Map_7] #0] shuffle.Fetcher: Shared fetch
failed to return 1 inputs on this try
2015-08-14 02:09:11,761 INFO [ShuffleRunner [Map_7]] impl.ShuffleManager:
Scheduling fetch for inputHost: machine1:13562
2015-08-14 02:09:11,761 INFO [ShuffleRunner [Map_7]] impl.ShuffleManager:
Created Fetcher for host: machine1 with inputs: [InputAttemptIdentifier
[inputIdentifier=InputIdentifier [inputIndex=0], attemptNumber=0,
pathComponent=attempt_1439264591968_0058_1_04_000000_0_10029,
fetchTypeInfo=FINAL_MERGE_ENABLED, spillEventId=-1]]
{noformat}
Based on disk / network, it might take time for fetcher to finish downloading
and release the lock. Since there was only one task in Map-1, it ended up in a
sort of tight loop generating relatively larger logs.
Looks like 260-290 MB task log files are created in this case per attempt.
That would be around 2.3 GB to 3 GB (depending on number of slots waiting) in
machine with 8-10 slots.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)