[ 
https://issues.apache.org/jira/browse/TEZ-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaveen Raajan updated TEZ-2338:
-------------------------------
    Description: 
I successfully Build Tez-0.6.0 against Hadoop-2.5.2
Then I configured Tez-0.6.0 as like in http://tez.apache.org/install.html

Moved Tez lib package to HDFS location and updated my tez-site.xml
{code:xml}
 <property>
    <name>tez.lib.uris</name>
<value>${fs.default.name}/apps/Tez/,${fs.default.name}/apps/Tez/lib/</value>
  </property>
{code}
After that I tried the sample test for tez

_hadoop jar tez-examples-0.6.0.jar orderedwordcount <input> <output>_

But I face following error while running this command
*Note:* I'm using HADOOP High Availability setup.

{code}
Running OrderedWordCount
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/C:/Hadoop/
share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBind
er.class]
SLF4J: Found binding in [jar:file:/C:/Tez/lib
/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/04/15 10:47:57 INFO client.TezClient: Tez Client Version: [ component=tez-api
, version=0.6.0, revision=${buildNumber}, SCM-URL=scm:git:https://git-wip-us.apa
che.org/repos/asf/tez.git, buildTime=2015-04-15T01:13:02Z ]
15/04/15 10:48:00 INFO client.TezClient: Submitting DAG application with id: app
lication_1429073725727_0005
15/04/15 10:48:00 INFO Configuration.deprecation: fs.default.name is deprecated.
 Instead, use fs.defaultFS
15/04/15 10:48:00 INFO client.TezClientUtils: Using tez.lib.uris value from conf
iguration: hdfs://HA-Cluster/apps/Tez/,hdfs://HA-Cluster/apps/Tez/lib/
15/04/15 10:48:01 INFO client.TezClient: Stage directory /tmp/app/tez/sta
ging doesn't exist and is created
15/04/15 10:48:01 INFO client.TezClient: Tez system stage directory 
hdfs://HA-cluster
/tmp/app/tez/staging/.tez/application_1429073725727_0005 doesn't ex
ist and is created
15/04/15 10:48:02 INFO client.TezClient: Submitting DAG to YARN, applicationId=a
pplication_1429073725727_0005, dagName=OrderedWordCount
15/04/15 10:48:03 INFO impl.YarnClientImpl: Submitted application application_14
29073725727_0005
15/04/15 10:48:03 INFO client.TezClient: The url to track the Tez AM: http://syn
cserver34:8088/proxy/application_1429073725727_0005/
15/04/15 10:48:03 INFO client.DAGClientImpl: Waiting for DAG to start running
15/04/15 10:48:09 INFO client.DAGClientImpl: DAG completed. FinalState=FAILED
OrderedWordCount failed with diagnostics: [Application application_1429073725727
_0005 failed 2 times due to AM Container for appattempt_1429073725727_0005_00000
2 exited with  exitCode: -1073741515 due to: Exception from container-launch: Ex
itCodeException exitCode=-1073741515:
ExitCodeException exitCode=-1073741515:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
        at org.apache.hadoop.util.Shell.run(Shell.java:455)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
702)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.la
unchContainer(DefaultContainerExecutor.java:195)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.C
ontainerLaunch.call(ContainerLaunch.java:300)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.C
ontainerLaunch.call(ContainerLaunch.java:81)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:615)
        at java.lang.Thread.run(Thread.java:744)

        1 file(s) moved.

Container exited with a non-zero exit code -1073741515
.Failing this attempt.. Failing the application.]
{code}

While Seeing at Resourcemanager log:
{code}
2015-04-19 21:49:57,533 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
completedContainer container=Container: [ContainerId: 
container_1429505171727_0001_02_000001, NodeId: SLAVE1:57794, NodeHttpAddress: 
SLAVE1:8042, Resource: <memory:2048, vCores:1>, Priority: 0, Token: Token { 
kind: ContainerToken, service: 172.16.100.92:57794 }, ] queue=default: 
capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0, vCores:0>, 
usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=1, numContainers=0 
cluster=<memory:8192, vCores:8>
2015-04-19 21:49:57,533 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
completedContainer queue=root usedCapacity=0.0 absoluteUsedCapacity=0.0 
used=<memory:0, vCores:0> cluster=<memory:8192, vCores:8>
2015-04-19 21:49:57,533 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
Re-sorting completed queue: root.default stats: default: capacity=1.0, 
absoluteCapacity=1.0, usedResources=<memory:0, vCores:0>, usedCapacity=0.0, 
absoluteUsedCapacity=0.0, numApps=1, numContainers=0
2015-04-19 21:49:57,533 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Application attempt appattempt_1429505171727_0001_000002 released container 
container_1429505171727_0001_02_000001 on node: host: SLAVE1:57794 
#containers=0 available=8192 used=0 with event: FINISHED
2015-04-19 21:49:57,580 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher 
event type: NodeDataChanged with state:UserConnected for 
path:/rmstore/ZKRMStateRoot/RMAppRoot/application_1429505171727_0001/appattempt_1429505171727_0001_000002
 for Service 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
2015-04-19 21:49:57,580 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
Unregistering app attempt : appattempt_1429505171727_0001_000002
2015-04-19 21:49:57,580 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_1429505171727_0001_000002 State change from FINAL_SAVING to FAILED
2015-04-19 21:49:57,580 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating 
application application_1429505171727_0001 with final state: FAILED
2015-04-19 21:49:57,580 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
application_1429505171727_0001 State change from ACCEPTED to FINAL_SAVING
2015-04-19 21:49:57,580 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Updating 
info for app: application_1429505171727_0001
2015-04-19 21:49:57,580 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Application Attempt appattempt_1429505171727_0001_000002 is done. 
finalState=FAILED
2015-04-19 21:49:57,580 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: 
Application application_1429505171727_0001 requests cleared
2015-04-19 21:49:57,580 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
Application removed - appId: application_1429505171727_0001 user: SYSTEM queue: 
default #user-pending-applications: 0 #user-active-applications: 0 
#queue-pending-applications: 0 #queue-active-applications: 0
2015-04-19 21:49:57,611 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher 
event type: NodeDataChanged with state:SyncConnected for 
path:/rmstore/ZKRMStateRoot/RMAppRoot/application_1429505171727_0001 for 
Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in 
state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: 
STARTED
2015-04-19 21:49:57,611 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application 
application_1429505171727_0001 failed 2 times due to AM Container for 
appattempt_1429505171727_0001_000002 exited with  exitCode: -1073741515 due to: 
Exception from container-launch: ExitCodeException exitCode=-1073741515: 
ExitCodeException exitCode=-1073741515: 
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
        at org.apache.hadoop.util.Shell.run(Shell.java:455)
        at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
        at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)

        1 file(s) moved.

Container exited with a non-zero exit code -1073741515
.Failing this attempt.. Failing the application.
2015-04-19 21:49:57,627 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
application_1429505171727_0001 State change from FINAL_SAVING to FAILED
2015-04-19 21:49:57,627 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
Application removed - appId: application_1429505171727_0001 user: SYSTEM 
leaf-queue of parent: root #applications: 0
2015-04-19 21:49:57,627 WARN 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=SYSTEM   
OPERATION=Application Finished - Failed TARGET=RMAppManager     RESULT=FAILURE  
DESCRIPTION=App failed with state: FAILED       PERMISSIONS=Application 
application_1429505171727_0001 failed 2 times due to AM Container for 
appattempt_1429505171727_0001_000002 exited with  exitCode: -1073741515 due to: 
Exception from container-launch: ExitCodeException exitCode=-1073741515: 
ExitCodeException exitCode=-1073741515: 
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
        at org.apache.hadoop.util.Shell.run(Shell.java:455)
        at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
        at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)

        1 file(s) moved.

Container exited with a non-zero exit code -1073741515
.Failing this attempt.. Failing the application.        
APPID=application_1429505171727_0001
2015-04-19 21:49:57,627 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary: 
appId=application_1429505171727_0001,name=OrderedWordCount,user=SYSTEM,queue=default,state=FAILED,trackingUrl=http://MASTER_NN1:8088/cluster/app/application_1429505171727_0001,appMasterHost=N/A,startTime=1429505386589,finishTime=1429505397580,finalStatus=FAILED
2015-04-19 21:49:58,580 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for 
port 8032: readAndProcess from client 172.16.100.XX threw exception 
[java.io.IOException: An existing connection was forcibly closed by the remote 
host]
{code}

At nodemanager logs
{code}
2015-04-20 10:19:59,365 INFO 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: 
launchContainer: [C:\Hadoop\bin\winutils.exe, task, create, 
container_1429505171727_0001_02_000001, cmd /c 
/tmp/hadoop-SLAVE1$/nm-local-dir/usercache/SYSTEM/appcache/application_1429505171727_0001/container_1429505171727_0001_02_000001/default_container_executor.cmd]
2015-04-20 10:19:59,436 WARN 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code 
from container container_1429505171727_0001_02_000001 is : -1073741515
2015-04-20 10:19:59,437 WARN 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception 
from container-launch with container ID: container_1429505171727_0001_02_000001 
and exit code: -1073741515
ExitCodeException exitCode=-1073741515: 
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
        at org.apache.hadoop.util.Shell.run(Shell.java:455)
        at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
        at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
2015-04-20 10:19:59,438 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:         1 file(s) 
moved.

2015-04-20 10:19:59,439 WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
 Container exited with a non-zero exit code -1073741515
2015-04-20 10:19:59,439 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: 
Container container_1429505171727_0001_02_000001 transitioned from RUNNING to 
EXITED_WITH_FAILURE
2015-04-20 10:19:59,440 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
 Cleaning up container container_1429505171727_0001_02_000001
2015-04-20 10:19:59,480 INFO 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
absolute path : 
/tmp/hadoop-SLAVE1$/nm-local-dir/usercache/SYSTEM/appcache/application_1429505171727_0001/container_1429505171727_0001_02_000001
2015-04-20 10:19:59,480 WARN 
org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=SYSTEM       
OPERATION=Container Finished - Failed   TARGET=ContainerImpl    RESULT=FAILURE  
DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE    
APPID=application_1429505171727_0001    
CONTAINERID=container_1429505171727_0001_02_000001
2015-04-20 10:19:59,481 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: 
Container container_1429505171727_0001_02_000001 transitioned from 
EXITED_WITH_FAILURE to DONE
2015-04-20 10:19:59,481 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
 Removing container_1429505171727_0001_02_000001 from application 
application_1429505171727_0001
2015-04-20 10:19:59,481 INFO 
org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: ProcfsBasedProcessTree 
currently is supported only on Linux.
{code}

Problem might be while connecting to nodemanager it unable to handshake with 
ResourceManager.

If I try in single node hadoop cluster mean It working correctly.

  was:
I successfully Build Tez-0.6.0 against Hadoop-2.5.2
Then I configured Tez-0.6.0 as like in http://tez.apache.org/install.html

Moved Tez lib package to HDFS location and updated my tez-site.xml
{code:xml}
 <property>
    <name>tez.lib.uris</name>
<value>${fs.default.name}/apps/Tez/,${fs.default.name}/apps/Tez/lib/</value>
  </property>
{code}
After that I tried the sample test for tez

_hadoop jar tez-examples-0.6.0.jar orderedwordcount <input> <output>_

But I face following error while running this command

{code}
Running OrderedWordCount
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/C:/Hadoop/
share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBind
er.class]
SLF4J: Found binding in [jar:file:/C:/Tez/lib
/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/04/15 10:47:57 INFO client.TezClient: Tez Client Version: [ component=tez-api
, version=0.6.0, revision=${buildNumber}, SCM-URL=scm:git:https://git-wip-us.apa
che.org/repos/asf/tez.git, buildTime=2015-04-15T01:13:02Z ]
15/04/15 10:48:00 INFO client.TezClient: Submitting DAG application with id: app
lication_1429073725727_0005
15/04/15 10:48:00 INFO Configuration.deprecation: fs.default.name is deprecated.
 Instead, use fs.defaultFS
15/04/15 10:48:00 INFO client.TezClientUtils: Using tez.lib.uris value from conf
iguration: hdfs://HA-Cluster/apps/Tez/,hdfs://HA-Cluster/apps/Tez/lib/
15/04/15 10:48:01 INFO client.TezClient: Stage directory /tmp/app/tez/sta
ging doesn't exist and is created
15/04/15 10:48:01 INFO client.TezClient: Tez system stage directory 
hdfs://HA-cluster
/tmp/app/tez/staging/.tez/application_1429073725727_0005 doesn't ex
ist and is created
15/04/15 10:48:02 INFO client.TezClient: Submitting DAG to YARN, applicationId=a
pplication_1429073725727_0005, dagName=OrderedWordCount
15/04/15 10:48:03 INFO impl.YarnClientImpl: Submitted application application_14
29073725727_0005
15/04/15 10:48:03 INFO client.TezClient: The url to track the Tez AM: http://syn
cserver34:8088/proxy/application_1429073725727_0005/
15/04/15 10:48:03 INFO client.DAGClientImpl: Waiting for DAG to start running
15/04/15 10:48:09 INFO client.DAGClientImpl: DAG completed. FinalState=FAILED
OrderedWordCount failed with diagnostics: [Application application_1429073725727
_0005 failed 2 times due to AM Container for appattempt_1429073725727_0005_00000
2 exited with  exitCode: -1073741515 due to: Exception from container-launch: Ex
itCodeException exitCode=-1073741515:
ExitCodeException exitCode=-1073741515:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
        at org.apache.hadoop.util.Shell.run(Shell.java:455)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
702)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.la
unchContainer(DefaultContainerExecutor.java:195)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.C
ontainerLaunch.call(ContainerLaunch.java:300)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.C
ontainerLaunch.call(ContainerLaunch.java:81)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:615)
        at java.lang.Thread.run(Thread.java:744)

        1 file(s) moved.

Container exited with a non-zero exit code -1073741515
.Failing this attempt.. Failing the application.]
{code}

While Seeing at Resourcemanager log:
{code}
15/04/15 12:56:15 ERROR scheduler.SchedulerApplicationAttempt: Error trying to a
ssign container token and NM token to an allocated container container_142908227
1173_0001_01_000001
java.lang.IllegalArgumentException: java.net.UnknownHostException: MasterNode
        at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUti
l.java:373)
        at org.apache.hadoop.yarn.server.utils.BuilderUtils.newContainerToken(Bu
ilderUtils.java:247)
        at org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTok
enSecretManager.createContainerToken(RMContainerTokenSecretManager.java:199)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerAppl
icationAttempt.pullNewlyAllocatedContainersAndNMTokens(SchedulerApplicationAttem
pt.java:425)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.F
iCaSchedulerApp.getAllocation(FiCaSchedulerApp.java:248)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.Capa
cityScheduler.allocate(CapacityScheduler.java:736)
        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAtte
mptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:816)
        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAtte
mptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:809)
        at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.
doTransition(StateMachineFactory.java:385)
        at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMa
chineFactory.java:302)
        at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMach
ineFactory.java:46)
        at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine
.doTransition(StateMachineFactory.java:448)
        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAtte
mptImpl.handle(RMAppAttemptImpl.java:649)
        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAtte
mptImpl.handle(RMAppAttemptImpl.java:104)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$Applica
tionAttemptEventDispatcher.handle(ResourceManager.java:761)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$Applica
tionAttemptEventDispatcher.handle(ResourceManager.java:742)
        at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher
.java:173)
        at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.ja
va:106)
        at java.lang.Thread.run(Thread.java:744)
Caused by: java.net.UnknownHostException: MasterNode
        ... 19 more
{code}
Problem might be while connecting to nodemanager it unable to handshake with 
ResourceManager.

If I try in single node hadoop cluster mean It working correctly.


> Tez job failed due to AM Container-Launch failure at windows
> ------------------------------------------------------------
>
>                 Key: TEZ-2338
>                 URL: https://issues.apache.org/jira/browse/TEZ-2338
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>         Environment: Windows server 2012 and Windows-8
> Hadoop-2.5.2
> Java-1.7
>            Reporter: Kaveen Raajan
>
> I successfully Build Tez-0.6.0 against Hadoop-2.5.2
> Then I configured Tez-0.6.0 as like in http://tez.apache.org/install.html
> Moved Tez lib package to HDFS location and updated my tez-site.xml
> {code:xml}
>  <property>
>     <name>tez.lib.uris</name>
> <value>${fs.default.name}/apps/Tez/,${fs.default.name}/apps/Tez/lib/</value>
>   </property>
> {code}
> After that I tried the sample test for tez
> _hadoop jar tez-examples-0.6.0.jar orderedwordcount <input> <output>_
> But I face following error while running this command
> *Note:* I'm using HADOOP High Availability setup.
> {code}
> Running OrderedWordCount
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in [jar:file:/C:/Hadoop/
> share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBind
> er.class]
> SLF4J: Found binding in [jar:file:/C:/Tez/lib
> /slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 15/04/15 10:47:57 INFO client.TezClient: Tez Client Version: [ 
> component=tez-api
> , version=0.6.0, revision=${buildNumber}, 
> SCM-URL=scm:git:https://git-wip-us.apa
> che.org/repos/asf/tez.git, buildTime=2015-04-15T01:13:02Z ]
> 15/04/15 10:48:00 INFO client.TezClient: Submitting DAG application with id: 
> app
> lication_1429073725727_0005
> 15/04/15 10:48:00 INFO Configuration.deprecation: fs.default.name is 
> deprecated.
>  Instead, use fs.defaultFS
> 15/04/15 10:48:00 INFO client.TezClientUtils: Using tez.lib.uris value from 
> conf
> iguration: hdfs://HA-Cluster/apps/Tez/,hdfs://HA-Cluster/apps/Tez/lib/
> 15/04/15 10:48:01 INFO client.TezClient: Stage directory /tmp/app/tez/sta
> ging doesn't exist and is created
> 15/04/15 10:48:01 INFO client.TezClient: Tez system stage directory 
> hdfs://HA-cluster
> /tmp/app/tez/staging/.tez/application_1429073725727_0005 doesn't ex
> ist and is created
> 15/04/15 10:48:02 INFO client.TezClient: Submitting DAG to YARN, 
> applicationId=a
> pplication_1429073725727_0005, dagName=OrderedWordCount
> 15/04/15 10:48:03 INFO impl.YarnClientImpl: Submitted application 
> application_14
> 29073725727_0005
> 15/04/15 10:48:03 INFO client.TezClient: The url to track the Tez AM: 
> http://syn
> cserver34:8088/proxy/application_1429073725727_0005/
> 15/04/15 10:48:03 INFO client.DAGClientImpl: Waiting for DAG to start running
> 15/04/15 10:48:09 INFO client.DAGClientImpl: DAG completed. FinalState=FAILED
> OrderedWordCount failed with diagnostics: [Application 
> application_1429073725727
> _0005 failed 2 times due to AM Container for 
> appattempt_1429073725727_0005_00000
> 2 exited with  exitCode: -1073741515 due to: Exception from container-launch: 
> Ex
> itCodeException exitCode=-1073741515:
> ExitCodeException exitCode=-1073741515:
>         at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
>         at org.apache.hadoop.util.Shell.run(Shell.java:455)
>         at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
> 702)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.la
> unchContainer(DefaultContainerExecutor.java:195)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.C
> ontainerLaunch.call(ContainerLaunch.java:300)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.C
> ontainerLaunch.call(ContainerLaunch.java:81)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
> java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
> .java:615)
>         at java.lang.Thread.run(Thread.java:744)
>         1 file(s) moved.
> Container exited with a non-zero exit code -1073741515
> .Failing this attempt.. Failing the application.]
> {code}
> While Seeing at Resourcemanager log:
> {code}
> 2015-04-19 21:49:57,533 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> completedContainer container=Container: [ContainerId: 
> container_1429505171727_0001_02_000001, NodeId: SLAVE1:57794, 
> NodeHttpAddress: SLAVE1:8042, Resource: <memory:2048, vCores:1>, Priority: 0, 
> Token: Token { kind: ContainerToken, service: 172.16.100.92:57794 }, ] 
> queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0, 
> vCores:0>, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=1, 
> numContainers=0 cluster=<memory:8192, vCores:8>
> 2015-04-19 21:49:57,533 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> completedContainer queue=root usedCapacity=0.0 absoluteUsedCapacity=0.0 
> used=<memory:0, vCores:0> cluster=<memory:8192, vCores:8>
> 2015-04-19 21:49:57,533 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Re-sorting completed queue: root.default stats: default: capacity=1.0, 
> absoluteCapacity=1.0, usedResources=<memory:0, vCores:0>, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=0
> 2015-04-19 21:49:57,533 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Application attempt appattempt_1429505171727_0001_000002 released container 
> container_1429505171727_0001_02_000001 on node: host: SLAVE1:57794 
> #containers=0 available=8192 used=0 with event: FINISHED
> 2015-04-19 21:49:57,580 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Watcher event type: NodeDataChanged with state:UserConnected for 
> path:/rmstore/ZKRMStateRoot/RMAppRoot/application_1429505171727_0001/appattempt_1429505171727_0001_000002
>  for Service 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
> 2015-04-19 21:49:57,580 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
> Unregistering app attempt : appattempt_1429505171727_0001_000002
> 2015-04-19 21:49:57,580 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1429505171727_0001_000002 State change from FINAL_SAVING to FAILED
> 2015-04-19 21:49:57,580 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating 
> application application_1429505171727_0001 with final state: FAILED
> 2015-04-19 21:49:57,580 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
> application_1429505171727_0001 State change from ACCEPTED to FINAL_SAVING
> 2015-04-19 21:49:57,580 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Updating 
> info for app: application_1429505171727_0001
> 2015-04-19 21:49:57,580 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Application Attempt appattempt_1429505171727_0001_000002 is done. 
> finalState=FAILED
> 2015-04-19 21:49:57,580 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: 
> Application application_1429505171727_0001 requests cleared
> 2015-04-19 21:49:57,580 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> Application removed - appId: application_1429505171727_0001 user: SYSTEM 
> queue: default #user-pending-applications: 0 #user-active-applications: 0 
> #queue-pending-applications: 0 #queue-active-applications: 0
> 2015-04-19 21:49:57,611 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Watcher event type: NodeDataChanged with state:SyncConnected for 
> path:/rmstore/ZKRMStateRoot/RMAppRoot/application_1429505171727_0001 for 
> Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore 
> in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: 
> STARTED
> 2015-04-19 21:49:57,611 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application 
> application_1429505171727_0001 failed 2 times due to AM Container for 
> appattempt_1429505171727_0001_000002 exited with  exitCode: -1073741515 due 
> to: Exception from container-launch: ExitCodeException exitCode=-1073741515: 
> ExitCodeException exitCode=-1073741515: 
>       at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
>       at org.apache.hadoop.util.Shell.run(Shell.java:455)
>       at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:744)
>         1 file(s) moved.
> Container exited with a non-zero exit code -1073741515
> .Failing this attempt.. Failing the application.
> 2015-04-19 21:49:57,627 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
> application_1429505171727_0001 State change from FINAL_SAVING to FAILED
> 2015-04-19 21:49:57,627 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Application removed - appId: application_1429505171727_0001 user: SYSTEM 
> leaf-queue of parent: root #applications: 0
> 2015-04-19 21:49:57,627 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=SYSTEM 
> OPERATION=Application Finished - Failed TARGET=RMAppManager     
> RESULT=FAILURE  DESCRIPTION=App failed with state: FAILED       
> PERMISSIONS=Application application_1429505171727_0001 failed 2 times due to 
> AM Container for appattempt_1429505171727_0001_000002 exited with  exitCode: 
> -1073741515 due to: Exception from container-launch: ExitCodeException 
> exitCode=-1073741515: 
> ExitCodeException exitCode=-1073741515: 
>       at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
>       at org.apache.hadoop.util.Shell.run(Shell.java:455)
>       at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:744)
>         1 file(s) moved.
> Container exited with a non-zero exit code -1073741515
> .Failing this attempt.. Failing the application.      
> APPID=application_1429505171727_0001
> 2015-04-19 21:49:57,627 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary:
>  
> appId=application_1429505171727_0001,name=OrderedWordCount,user=SYSTEM,queue=default,state=FAILED,trackingUrl=http://MASTER_NN1:8088/cluster/app/application_1429505171727_0001,appMasterHost=N/A,startTime=1429505386589,finishTime=1429505397580,finalStatus=FAILED
> 2015-04-19 21:49:58,580 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 
> for port 8032: readAndProcess from client 172.16.100.XX threw exception 
> [java.io.IOException: An existing connection was forcibly closed by the 
> remote host]
> {code}
> At nodemanager logs
> {code}
> 2015-04-20 10:19:59,365 INFO 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: 
> launchContainer: [C:\Hadoop\bin\winutils.exe, task, create, 
> container_1429505171727_0001_02_000001, cmd /c 
> /tmp/hadoop-SLAVE1$/nm-local-dir/usercache/SYSTEM/appcache/application_1429505171727_0001/container_1429505171727_0001_02_000001/default_container_executor.cmd]
> 2015-04-20 10:19:59,436 WARN 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code 
> from container container_1429505171727_0001_02_000001 is : -1073741515
> 2015-04-20 10:19:59,437 WARN 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception 
> from container-launch with container ID: 
> container_1429505171727_0001_02_000001 and exit code: -1073741515
> ExitCodeException exitCode=-1073741515: 
>       at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
>       at org.apache.hadoop.util.Shell.run(Shell.java:455)
>       at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:744)
> 2015-04-20 10:19:59,438 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:         1 
> file(s) moved.
> 2015-04-20 10:19:59,439 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Container exited with a non-zero exit code -1073741515
> 2015-04-20 10:19:59,439 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1429505171727_0001_02_000001 transitioned from RUNNING 
> to EXITED_WITH_FAILURE
> 2015-04-20 10:19:59,440 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1429505171727_0001_02_000001
> 2015-04-20 10:19:59,480 INFO 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
> absolute path : 
> /tmp/hadoop-SLAVE1$/nm-local-dir/usercache/SYSTEM/appcache/application_1429505171727_0001/container_1429505171727_0001_02_000001
> 2015-04-20 10:19:59,480 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=SYSTEM     
> OPERATION=Container Finished - Failed   TARGET=ContainerImpl    
> RESULT=FAILURE  DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE  
>   APPID=application_1429505171727_0001    
> CONTAINERID=container_1429505171727_0001_02_000001
> 2015-04-20 10:19:59,481 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1429505171727_0001_02_000001 transitioned from 
> EXITED_WITH_FAILURE to DONE
> 2015-04-20 10:19:59,481 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
>  Removing container_1429505171727_0001_02_000001 from application 
> application_1429505171727_0001
> 2015-04-20 10:19:59,481 INFO 
> org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: ProcfsBasedProcessTree 
> currently is supported only on Linux.
> {code}
> Problem might be while connecting to nodemanager it unable to handshake with 
> ResourceManager.
> If I try in single node hadoop cluster mean It working correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to