[ 
https://issues.apache.org/jira/browse/TEZ-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-2378:
----------------------------------
    Attachment: TEZ-2378.1.patch

- For broadcast, fetcher tries to download and cache it locally when 
shared-fetch is enabled. Subsequent tasks scheduled on the same node would be 
able to read from local disk as opposed to downloading from remote machine. For 
example, When InputHost has got 4 srcAttempts, it is quite possible that couple 
of them were already downloaded in local disks (and rest are yet to be 
downloaded).  Yet to be downloaded are scheduled via doHttpFetch. Just before 
this step, it tries to optimize by reading whatever is available in local disks 
and logs disk exceptions for whatever tasks are not available. It falls back to 
http fetch when data is not available locally. This leads to increased log size 
in large jobs and distracts debugging. 



> In case Fetcher (unordered) fails to do local fetch, log in debug mode to 
> reduce log size
> -----------------------------------------------------------------------------------------
>
>                 Key: TEZ-2378
>                 URL: https://issues.apache.org/jira/browse/TEZ-2378
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>         Attachments: TEZ-2378.1.patch
>
>
> Following can be logged as debug mode as opposed to WARN level. May be 
> counters can be added later to track the number of times it failed to do 
> local-fetch.
> {noformat}
> 2015-04-28 05:41:45,487 WARN [Fetcher [Map_5] #15] shuffle.Fetcher: Failed to 
> shuffle output of InputAttemptIdentifier [inputIdentifier=InputIdentifier 
> [inputIndex=81], attemptNumber=0, 
> pathComponent=attempt_1429683757595_0485_1_03_000081_0_10003, 
> fetchTypeInfo=FINAL_MERGE_ENABLED, spillEventId=-1] from 
> cn047-10.l42scl.hortonworks.com(local fetch)
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
> output/attempt_1429683757595_0485_1_03_000081_0_10003/file.out.index in any 
> of the configured local directories
>         at 
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:449)
>         at 
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:164)
>         at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.getShuffleInputFileName(Fetcher.java:612)
>         at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.getTezIndexRecord(Fetcher.java:592)
>         at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.doLocalDiskFetch(Fetcher.java:537)
>         at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.doSharedFetch(Fetcher.java:353)
>         at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:192)
>         at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:72)
>         at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to