[
https://issues.apache.org/jira/browse/TEZ-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Siddharth Seth updated TEZ-2366:
--------------------------------
Attachment: TEZ-2366.test.txt
This is what I believe causes this.
Pig is running the MiniCluster with multiple instances of the NodeManager -
which is good. The moment there's multiple instances however, the hostname on
all of them matches, and each task attempts doing a local fetch, even though
the data may have been generated on a different NodeManager (implying a
different local-dir).
Tez never runs into this, because all tests run with a single NodeManager
instance.
Two possible fixes - disable local-fetch for MiniCluster, for which I'm
uploading a temporary patch.
The other is to potentially make use of port information to figure out whether
to do a local fetch or not.
[~daijy] - please try out this patch.
Alternately, disable local fetch in the pig config after setting up the cluster
and calling getConfg on the cluster instance.
> Pig tez MiniTezCluster unit tests fail intermittently after TEZ-2333
> --------------------------------------------------------------------
>
> Key: TEZ-2366
> URL: https://issues.apache.org/jira/browse/TEZ-2366
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Daniel Dai
> Priority: Critical
> Attachments: TEZ-2366.test.txt
>
>
> There are around 20 unit tests (out of around 2000) fail intermittently after
> TEZ-2333. Here is a stack:
> {code}
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> output/attempt_1429899954360_0001_1_01_000000_1_10003/file.out.index in any
> of the configured local directories
> at
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:449)
> at
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:164)
> at
> org.apache.tez.runtime.library.common.shuffle.Fetcher.getShuffleInputFileName(Fetcher.java:611)
> at
> org.apache.tez.runtime.library.common.shuffle.Fetcher.getTezIndexRecord(Fetcher.java:591)
> at
> org.apache.tez.runtime.library.common.shuffle.Fetcher.doLocalDiskFetch(Fetcher.java:536)
> at
> org.apache.tez.runtime.library.common.shuffle.Fetcher.setupLocalDiskFetch(Fetcher.java:517)
> at
> org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:190)
> at
> org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:72)
> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> To reproduce that in Pig test, using the following commands:
> svn co http://svn.apache.org/repos/asf/pig/trunk
> ant -Dhadoopversion=23 -Dtest.exec.type=tez -Dtestcase=TestTezAutoParallelism
> test
> Note in Pig codebase, we already set TEZ_RUNTIME_OPTIMIZE_LOCAL_FETCH to
> "true"
> (http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezLauncher.java?view=markup).
> I tried changing TEZ_RUNTIME_OPTIMIZE_LOCAL_FETCH to "false" in Pig and does
> not help.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)