I have a Spark Streaming app (1.5.2 compile for hadoop 2.6) that occasionally 
has problem writing its checkpoint file.  This is a YARN (yarn cluster) app 
running as user mapred.

What I see in my streaming app logs are:
App log
App log
15/12/15 00:14:08   server51: 
/mnt/hdfs/hdfs01/data4/yarn-log/application_1450104983109_0273/container_1450104983109_0273_01_000004/stderr:15/12/15
 00:14:08 INFO hdfs.DFSClient: Could not complete 
/user/hadoop/yarn-app/session_svc/checkpoint/420fb269-490f-41fe-94a9-7c60be7a0931/rdd-665/.part-00022-attempt-0
 retrying...
15/12/15 00:14:20   server51: 
/mnt/hdfs/hdfs01/data4/yarn-log/application_1450104983109_0273/container_1450104983109_0273_01_000004/stderr:15/12/15
 00:14:20 INFO hdfs.DFSClient: Could not complete 
/user/hadoop/yarn-app/session_svc/checkpoint/420fb269-490f-41fe-94a9-7c60be7a0931/rdd-665/.part-00025-attempt-0
 retrying...
15/12/15 00:14:24   server51: 
/mnt/hdfs/hdfs01/data4/yarn-log/application_1450104983109_0273/container_1450104983109_0273_01_000004/stderr:15/12/15
 00:14:24 INFO hdfs.DFSClient: Could not complete 
/user/hadoop/yarn-app/session_svc/checkpoint/420fb269-490f-41fe-94a9-7c60be7a0931/rdd-665/.part-00025-attempt-0
 retrying...
15/12/15 00:14:50   server51: 
/mnt/hdfs/hdfs01/data4/yarn-log/application_1450104983109_0273/container_1450104983109_0273_01_000004/stderr:15/12/15
 00:14:50 INFO hdfs.DFSClient: Could not complete 
/user/hadoop/yarn-app/session_svc/checkpoint/420fb269-490f-41fe-94a9-7c60be7a0931/rdd-665/.part-00046-attempt-0
 retrying...
15/12/15 00:14:55   server51: 
/mnt/hdfs/hdfs01/data4/yarn-log/application_1450104983109_0273/container_1450104983109_0273_01_000004/stderr:15/12/15
 00:14:55 INFO hdfs.DFSClient: Could not complete 
/user/hadoop/yarn-app/session_svc/checkpoint/420fb269-490f-41fe-94a9-7c60be7a0931/rdd-665/.part-00046-attempt-0
 retrying…


server51 = xx.xx.xx.199

In the name node logs during that same time, I see a bunch of these messages:

[2015-12-15T00:14:08.433Z,INFO,org.apache.hadoop.hdfs.server.namenode.FSNamesystem:,BLOCK*
 checkFileProgress: blk_2161610531_1087887542{blockUCState=COMMITTED, 
truncateBlock=null, primaryNodeIndex=-1, 
replicas=[ReplicaUnderConstruction[[DISK]DS-45109dfc-c3e5-4fd7-a6de-9cec1db8e301:NORMAL:xx.xx.xx.138:50010|RBW],
 
ReplicaUnderConstruction[[DISK]DS-61c84f08-e038-442d-8f64-238a9bbd8803:NORMAL:xx.xx.xx.199:50010|RBW],
 
ReplicaUnderConstruction[[DISK]DS-bc8b67eb-2715-4300-bc44-7ba5b54e80c8:NORMAL:xx.xx.xx.119:50010|RBW]]}
 has not reached minimal replication 1]

If I filter those messages out, the remaining messages during that time frame 
from xx.xx.xx.199 are:

[2015-12-15T00:00:08.449Z,INFO,org.apache.hadoop.ipc.Server:,IPC Server handler 
17 on 8020: skipped org.apache.hadoop.hdfs.protocol.ClientProtocol.renewLease 
from xx.xx.xx.199:46602 Call#2746 Retry#0]
[2015-12-15T00:10:29.609Z,INFO,org.apache.hadoop.ipc.Server:,IPC Server handler 
47 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from 
xx.xx.xx.199:59576 Call#38454 Retry#0: 
org.apache.hadoop.security.AccessControlException: Permission denied: 
user=yarn, access=EXECUTE, inode="/app-logs/mapred/logs":mapred:hdfs:drwxrwx---]
[2015-12-15T00:13:24.556Z,INFO,org.apache.hadoop.ipc.Server:,Socket Reader #1 
for port 8020: readAndProcess from client xx.xx.xx.199 threw exception 
[java.io.IOException: Connection reset by peer]]
[2015-12-15T00:13:24.673Z,INFO,org.apache.hadoop.ipc.Server:,IPC Server handler 
75 on 8020: skipped org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo 
from xx.xx.xx.199:59984 Call#4765 Retry#0]
[2015-12-15T00:13:24.683Z,INFO,org.apache.hadoop.ipc.Server:,IPC Server handler 
70 on 8020: skipped 
org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from 
xx.xx.xx.199:59984 Call#4766 Retry#0]
[2015-12-15T00:13:24.699Z,INFO,org.apache.hadoop.ipc.Server:,IPC Server handler 
64 on 8020: skipped 
org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from 
xx.xx.xx.199:59984 Call#4767 Retry#0]
[2015-12-15T00:13:24.721Z,INFO,org.apache.hadoop.ipc.Server:,IPC Server handler 
57 on 8020: skipped 
org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from 
xx.xx.xx.199:59984 Call#4768 Retry#0]
[2015-12-15T00:13:42.641Z,INFO,org.apache.hadoop.ipc.Server:,Socket Reader #1 
for port 8020: readAndProcess from client xx.xx.xx.199 threw exception 
[java.io.IOException: Connection reset by peer]]

The user mapred can see the directory - but it is an empty directory (e.g., 
hdfs dfs -ls /app–logs/mapred/logs/application_1450104983109_0273/  returns 
nothing)
The message about execute permission denied seems off, I am wondering if some 
configuration is dialed down too low and maybe the permission error is just a 
red herring.


Has anyone else run into a problem where the checkpoint has IOExceptions 
running as a YARN app?

Thanks,
Robert

Reply via email to