[
https://issues.apache.org/jira/browse/TEZ-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rajesh Balamohan updated TEZ-2378:
----------------------------------
Attachment: TEZ-2378.1.patch
- For broadcast, fetcher tries to download and cache it locally when
shared-fetch is enabled. Subsequent tasks scheduled on the same node would be
able to read from local disk as opposed to downloading from remote machine. For
example, When InputHost has got 4 srcAttempts, it is quite possible that couple
of them were already downloaded in local disks (and rest are yet to be
downloaded). Yet to be downloaded are scheduled via doHttpFetch. Just before
this step, it tries to optimize by reading whatever is available in local disks
and logs disk exceptions for whatever tasks are not available. It falls back to
http fetch when data is not available locally. This leads to increased log size
in large jobs and distracts debugging.
> In case Fetcher (unordered) fails to do local fetch, log in debug mode to
> reduce log size
> -----------------------------------------------------------------------------------------
>
> Key: TEZ-2378
> URL: https://issues.apache.org/jira/browse/TEZ-2378
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Rajesh Balamohan
> Attachments: TEZ-2378.1.patch
>
>
> Following can be logged as debug mode as opposed to WARN level. May be
> counters can be added later to track the number of times it failed to do
> local-fetch.
> {noformat}
> 2015-04-28 05:41:45,487 WARN [Fetcher [Map_5] #15] shuffle.Fetcher: Failed to
> shuffle output of InputAttemptIdentifier [inputIdentifier=InputIdentifier
> [inputIndex=81], attemptNumber=0,
> pathComponent=attempt_1429683757595_0485_1_03_000081_0_10003,
> fetchTypeInfo=FINAL_MERGE_ENABLED, spillEventId=-1] from
> cn047-10.l42scl.hortonworks.com(local fetch)
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> output/attempt_1429683757595_0485_1_03_000081_0_10003/file.out.index in any
> of the configured local directories
> at
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:449)
> at
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:164)
> at
> org.apache.tez.runtime.library.common.shuffle.Fetcher.getShuffleInputFileName(Fetcher.java:612)
> at
> org.apache.tez.runtime.library.common.shuffle.Fetcher.getTezIndexRecord(Fetcher.java:592)
> at
> org.apache.tez.runtime.library.common.shuffle.Fetcher.doLocalDiskFetch(Fetcher.java:537)
> at
> org.apache.tez.runtime.library.common.shuffle.Fetcher.doSharedFetch(Fetcher.java:353)
> at
> org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:192)
> at
> org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:72)
> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)