[
https://issues.apache.org/jira/browse/TEZ-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15301499#comment-15301499
]
Feng Yuan commented on TEZ-3273:
--------------------------------
there is rm logs:
2016-05-26 10:33:31,765 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Storing
application with id application_1463493135662_90531
2016-05-26 10:33:31,765 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=bae
IP=192.168.44.40 OPERATION=Submit Application Request
TARGET=ClientRMService RESULT=SUCCESS APPID=application_1463493135662_90531
2016-05-26 10:33:31,765 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
application_1463493135662_90531 State change from NEW to NEW_SAVING
2016-05-26 10:33:31,765 INFO
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing
info for app: application_1463493135662_90531
2016-05-26 10:33:31,772 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
application_1463493135662_90531 State change from NEW_SAVING to SUBMITTED
2016-05-26 10:33:31,772 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Accepted application application_1463493135662_90531 from user: bae, in queue:
default, currently num of applications: 37
2016-05-26 10:33:31,772 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
application_1463493135662_90531 State change from SUBMITTED to ACCEPTED
2016-05-26 10:33:39,102 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt:
Making reservation: node=xx-44p123-hadoop109.xx.com
app_id=application_1463493135662_90531
2016-05-26 10:33:39,130 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt:
Making reservation: node=xx-44p112-hadoop98.xx.com
app_id=application_1463493135662_90531
2016-05-26 10:33:39,134 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt:
Making reservation: node=xx-44p118-hadoop104.xx.com
app_id=application_1463493135662_90531
2016-05-26 10:33:39,155 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=bae
OPERATION=AM Allocated Container TARGET=SchedulerApp RESULT=SUCCESS
APPID=application_1463493135662_90531
CONTAINERID=container_1463493135662_90531_01_000004
2016-05-26 10:33:39,155 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
Storing attempt: AppId: application_1463493135662_90531 AttemptId:
appattempt_1463493135662_90531_000001 MasterContainer: Container: [ContainerId:
container_1463493135662_90531_01_000004, NodeId:
xx-44p104-hadoop90.xx.com:59978, NodeHttpAddress:
xx-44p104-hadoop90.xx.com:8042, Resource: <memory:4096, vCores:1>, Priority: 0,
Token: Token { kind: ContainerToken, service: 192.168.44.104:59978 }, ]
2016-05-26 10:33:39,276 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt:
Application application_1463493135662_90531 unreserved on node host:
xx-44p118-hadoop104.xx.com:4695 #containers=17 available=2560 used=28160,
currently has 2 at priority 0; currentReservation <memory:8192, vCores:2>
2016-05-26 10:33:40,106 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt:
Application application_1463493135662_90531 unreserved on node host:
xx-44p123-hadoop109.xx.com:4335 #containers=15 available=2560 used=28160,
currently has 1 at priority 0; currentReservation <memory:4096, vCores:1>
2016-05-26 10:33:40,136 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt:
Application application_1463493135662_90531 unreserved on node host:
xx-44p112-hadoop98.xx.com:42781 #containers=16 available=2560 used=28160,
currently has 0 at priority 0; currentReservation <memory:0, vCores:0>
2016-05-26 10:33:43,238 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=bae
IP=192.168.44.104 OPERATION=Register App Master
TARGET=ApplicationMasterService RESULT=SUCCESS
APPID=application_1463493135662_90531
APPATTEMPTID=appattempt_1463493135662_90531_000001
2016-05-26 10:33:43,238 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
application_1463493135662_90531 State change from ACCEPTED to RUNNING
2016-05-26 10:33:48,748 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating
application application_1463493135662_90531 with final state: FINISHING
2016-05-26 10:33:48,748 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
application_1463493135662_90531 State change from RUNNING to FINAL_SAVING
2016-05-26 10:33:48,750 INFO
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher
event type: NodeDataChanged with state:SyncConnected for
path:/rmstore/ZKRMStateRoot/RMAppRoot/application_1463493135662_90531/appattempt_1463493135662_90531_000001
for Service
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
2016-05-26 10:33:48,750 INFO
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Updating
info for app: application_1463493135662_90531
2016-05-26 10:33:48,754 INFO
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher
event type: NodeDataChanged with state:SyncConnected for
path:/rmstore/ZKRMStateRoot/RMAppRoot/application_1463493135662_90531 for
Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in
state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
STARTED
2016-05-26 10:33:48,754 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
application_1463493135662_90531 State change from FINAL_SAVING to FINISHING
2016-05-26 10:33:48,857 INFO
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService:
application_1463493135662_90531 unregistered successfully.
2016-05-26 10:33:49,050 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=bae
OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS
APPID=application_1463493135662_90531
CONTAINERID=container_1463493135662_90531_01_000004
2016-05-26 10:33:49,050 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
application_1463493135662_90531 State change from FINISHING to FINISHED
2016-05-26 10:33:49,050 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=bae
OPERATION=Application Finished - Succeeded TARGET=RMAppManager
RESULT=SUCCESS APPID=application_1463493135662_90531
2016-05-26 10:33:49,050 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary:
appId=application_1463493135662_90531,name=BFD_JOB_TASK_1415_20160526103310,user=bae,queue=root.default,state=FINISHED,trackingUrl=http://xx-44p12-rm01:8088/proxy/application_1463493135662_90531/A,appMasterHost=xx-44p104-hadoop90.xx.com,startTime=1464230011765,finishTime=1464230028748,finalStatus=SUCCEEDED
2016-05-26 10:33:49,051 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo:
Application application_1463493135662_90531 requests cleared
there are many resever container and unresever .
nm log for AM container:
2016-05-26 10:33:39,161 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
Start request for container_1463493135662_90531_01_000004 by user bae
2016-05-26 10:33:39,161 INFO
org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=bae
IP=192.168.44.12 OPERATION=Start Container Request
TARGET=ContainerManageImpl RESULT=SUCCESS
APPID=application_1463493135662_90531
CONTAINERID=container_1463493135662_90531_01_000004
2016-05-26 10:33:39,162 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
Adding container_1463493135662_90531_01_000004 to application
application_1463493135662_90531
2016-05-26 10:33:39,169 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
Container container_1463493135662_90531_01_000004 transitioned from NEW to
LOCALIZING
2016-05-26 10:33:39,173 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
Created localizer for container_1463493135662_90531_01_000004
2016-05-26 10:33:39,175 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
Writing credentials to the nmPrivate file
/opt/data5/yarn_dir/local/nmPrivate/container_1463493135662_90531_01_000004.tokens.
Credentials list:
2016-05-26 10:33:39,342 INFO
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Copying
from
/opt/data5/yarn_dir/local/nmPrivate/container_1463493135662_90531_01_000004.tokens
to
/opt/data8/yarn_dir/local/usercache/bae/appcache/application_1463493135662_90531/container_1463493135662_90531_01_000004.tokens
2016-05-26 10:33:39,941 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
Container container_1463493135662_90531_01_000004 transitioned from LOCALIZING
to LOCALIZED
2016-05-26 10:33:39,964 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
Container container_1463493135662_90531_01_000004 transitioned from LOCALIZED
to RUNNING
2016-05-26 10:33:39,969 INFO
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:
launchContainer: [bash,
/opt/data5/yarn_dir/local/usercache/bae/appcache/application_1463493135662_90531/container_1463493135662_90531_01_000004/default_container_executor.sh]
2016-05-26 10:33:42,883 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Starting resource-monitoring for container_1463493135662_90531_01_000004
2016-05-26 10:33:43,223 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Memory usage of ProcessTree 38928 for container-id
container_1463493135662_90531_01_000004: 223.1 MB of 4 GB physical memory used;
4.1 GB of 8.4 GB virtual memory used
2016-05-26 10:33:46,765 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Memory usage of ProcessTree 38928 for container-id
container_1463493135662_90531_01_000004: 227.7 MB of 4 GB physical memory used;
4.1 GB of 8.4 GB virtual memory used
2016-05-26 10:33:48,908 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
Container container_1463493135662_90531_01_000004 succeeded
2016-05-26 10:33:48,908 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
Container container_1463493135662_90531_01_000004 transitioned from RUNNING to
EXITED_WITH_SUCCESS
2016-05-26 10:33:48,908 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
Cleaning up container container_1463493135662_90531_01_000004
.....
2016-05-26 10:33:48,929 INFO
org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=bae
OPERATION=Container Finished - Succeeded TARGET=ContainerImpl
RESULT=SUCCESS APPID=application_1463493135662_90531
CONTAINERID=container_1463493135662_90531_01_000004
2016-05-26 10:33:48,929 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
Container container_1463493135662_90531_01_000004 transitioned from
EXITED_WITH_SUCCESS to DONE
2016-05-26 10:33:48,929 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
Removing container_1463493135662_90531_01_000004 from application
application_1463493135662_90531
2016-05-26 10:33:48,929 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
Considering container container_1463493135662_90531_01_000004 for
log-aggregation
2016-05-26 10:33:49,056 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
Stopping container with container Id: container_1463493135662_90531_01_000004
2016-05-26 10:33:49,056 INFO
org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=bae
IP=192.168.44.12 OPERATION=Stop Container Request
TARGET=ContainerManageImpl RESULT=SUCCESS
APPID=application_1463493135662_90531
CONTAINERID=container_1463493135662_90531_01_000004
2016-05-26 10:33:49,059 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed
completed containers from NM context: [container_1463493135662_90531_01_000004]
2016-05-26 10:33:49,073 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
Uploading logs for container container_1463493135662_90531_01_000004. Current
good log dirs are
/opt/data10/yarn_dir/local,/opt/data2/yarn_dir/local,/opt/data4/yarn_dir/local,/opt/data6/yarn_dir/local,/opt/data8/yarn_dir/local,/opt/data5/yarn_dir/local,/opt/data1/yarn_dir/local,/opt/data9/yarn_dir/local,/opt/data7/yarn_dir/local,/opt/data3/yarn_dir/local
.....
2016-05-26 10:33:49,939 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Stopping resource-monitoring for container_1463493135662_90531_01_000004
you can see there are AM launched and exit imediately.
> In one vexter has some task failed,DAG will stuck forever.
> ----------------------------------------------------------
>
> Key: TEZ-3273
> URL: https://issues.apache.org/jira/browse/TEZ-3273
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.5.2
> Environment: hive0.14 hadoop2.6
> Reporter: Feng Yuan
>
> Map 1: 145(+0,-1)/146 Reducer 2: 0/415
> Map 1: 145(+0,-1)/146 Reducer 2: 0/415
> Map 1: 145(+0,-1)/146 Reducer 2: 0/415
> Map 1: 145(+0,-1)/146 Reducer 2: 0/415
> Map 1: 145(+0,-1)/146 Reducer 2: 0/415
> Map 1: 145(+0,-1)/146 Reducer 2: 0/415
> Map 1: 145(+0,-1)/146 Reducer 2: 0/415
> stuck forever~
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)