I have a Spark Streaming app (1.5.2 compile for hadoop 2.6) that occasionally has problem writing its checkpoint file. This is a YARN (yarn cluster) app running as user mapred.
What I see in my streaming app logs are: App log App log 15/12/15 00:14:08 server51: /mnt/hdfs/hdfs01/data4/yarn-log/application_1450104983109_0273/container_1450104983109_0273_01_000004/stderr:15/12/15 00:14:08 INFO hdfs.DFSClient: Could not complete /user/hadoop/yarn-app/session_svc/checkpoint/420fb269-490f-41fe-94a9-7c60be7a0931/rdd-665/.part-00022-attempt-0 retrying... 15/12/15 00:14:20 server51: /mnt/hdfs/hdfs01/data4/yarn-log/application_1450104983109_0273/container_1450104983109_0273_01_000004/stderr:15/12/15 00:14:20 INFO hdfs.DFSClient: Could not complete /user/hadoop/yarn-app/session_svc/checkpoint/420fb269-490f-41fe-94a9-7c60be7a0931/rdd-665/.part-00025-attempt-0 retrying... 15/12/15 00:14:24 server51: /mnt/hdfs/hdfs01/data4/yarn-log/application_1450104983109_0273/container_1450104983109_0273_01_000004/stderr:15/12/15 00:14:24 INFO hdfs.DFSClient: Could not complete /user/hadoop/yarn-app/session_svc/checkpoint/420fb269-490f-41fe-94a9-7c60be7a0931/rdd-665/.part-00025-attempt-0 retrying... 15/12/15 00:14:50 server51: /mnt/hdfs/hdfs01/data4/yarn-log/application_1450104983109_0273/container_1450104983109_0273_01_000004/stderr:15/12/15 00:14:50 INFO hdfs.DFSClient: Could not complete /user/hadoop/yarn-app/session_svc/checkpoint/420fb269-490f-41fe-94a9-7c60be7a0931/rdd-665/.part-00046-attempt-0 retrying... 15/12/15 00:14:55 server51: /mnt/hdfs/hdfs01/data4/yarn-log/application_1450104983109_0273/container_1450104983109_0273_01_000004/stderr:15/12/15 00:14:55 INFO hdfs.DFSClient: Could not complete /user/hadoop/yarn-app/session_svc/checkpoint/420fb269-490f-41fe-94a9-7c60be7a0931/rdd-665/.part-00046-attempt-0 retrying… server51 = xx.xx.xx.199 In the name node logs during that same time, I see a bunch of these messages: [2015-12-15T00:14:08.433Z,INFO,org.apache.hadoop.hdfs.server.namenode.FSNamesystem:,BLOCK* checkFileProgress: blk_2161610531_1087887542{blockUCState=COMMITTED, truncateBlock=null, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-45109dfc-c3e5-4fd7-a6de-9cec1db8e301:NORMAL:xx.xx.xx.138:50010|RBW], ReplicaUnderConstruction[[DISK]DS-61c84f08-e038-442d-8f64-238a9bbd8803:NORMAL:xx.xx.xx.199:50010|RBW], ReplicaUnderConstruction[[DISK]DS-bc8b67eb-2715-4300-bc44-7ba5b54e80c8:NORMAL:xx.xx.xx.119:50010|RBW]]} has not reached minimal replication 1] If I filter those messages out, the remaining messages during that time frame from xx.xx.xx.199 are: [2015-12-15T00:00:08.449Z,INFO,org.apache.hadoop.ipc.Server:,IPC Server handler 17 on 8020: skipped org.apache.hadoop.hdfs.protocol.ClientProtocol.renewLease from xx.xx.xx.199:46602 Call#2746 Retry#0] [2015-12-15T00:10:29.609Z,INFO,org.apache.hadoop.ipc.Server:,IPC Server handler 47 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from xx.xx.xx.199:59576 Call#38454 Retry#0: org.apache.hadoop.security.AccessControlException: Permission denied: user=yarn, access=EXECUTE, inode="/app-logs/mapred/logs":mapred:hdfs:drwxrwx---] [2015-12-15T00:13:24.556Z,INFO,org.apache.hadoop.ipc.Server:,Socket Reader #1 for port 8020: readAndProcess from client xx.xx.xx.199 threw exception [java.io.IOException: Connection reset by peer]] [2015-12-15T00:13:24.673Z,INFO,org.apache.hadoop.ipc.Server:,IPC Server handler 75 on 8020: skipped org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from xx.xx.xx.199:59984 Call#4765 Retry#0] [2015-12-15T00:13:24.683Z,INFO,org.apache.hadoop.ipc.Server:,IPC Server handler 70 on 8020: skipped org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from xx.xx.xx.199:59984 Call#4766 Retry#0] [2015-12-15T00:13:24.699Z,INFO,org.apache.hadoop.ipc.Server:,IPC Server handler 64 on 8020: skipped org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from xx.xx.xx.199:59984 Call#4767 Retry#0] [2015-12-15T00:13:24.721Z,INFO,org.apache.hadoop.ipc.Server:,IPC Server handler 57 on 8020: skipped org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from xx.xx.xx.199:59984 Call#4768 Retry#0] [2015-12-15T00:13:42.641Z,INFO,org.apache.hadoop.ipc.Server:,Socket Reader #1 for port 8020: readAndProcess from client xx.xx.xx.199 threw exception [java.io.IOException: Connection reset by peer]] The user mapred can see the directory - but it is an empty directory (e.g., hdfs dfs -ls /app–logs/mapred/logs/application_1450104983109_0273/ returns nothing) The message about execute permission denied seems off, I am wondering if some configuration is dialed down too low and maybe the permission error is just a red herring. Has anyone else run into a problem where the checkpoint has IOExceptions running as a YARN app? Thanks, Robert