Checkpoint directory structure

2015-09-23 Thread Bin Wang
I find the checkpoint directory structure is like this:

-rw-r--r--   1 root root 134820 2015-09-23 16:55
/user/root/checkpoint/checkpoint-144299850
-rw-r--r--   1 root root 134768 2015-09-23 17:00
/user/root/checkpoint/checkpoint-144299880
-rw-r--r--   1 root root 134895 2015-09-23 17:05
/user/root/checkpoint/checkpoint-144299910
-rw-r--r--   1 root root 134899 2015-09-23 17:10
/user/root/checkpoint/checkpoint-144299940
-rw-r--r--   1 root root 134913 2015-09-23 17:15
/user/root/checkpoint/checkpoint-144299970
-rw-r--r--   1 root root 134928 2015-09-23 17:20
/user/root/checkpoint/checkpoint-14430
-rw-r--r--   1 root root 134987 2015-09-23 17:25
/user/root/checkpoint/checkpoint-144300030
-rw-r--r--   1 root root 134944 2015-09-23 17:30
/user/root/checkpoint/checkpoint-144300060
-rw-r--r--   1 root root 134956 2015-09-23 17:35
/user/root/checkpoint/checkpoint-144300090
-rw-r--r--   1 root root 135244 2015-09-23 17:40
/user/root/checkpoint/checkpoint-144300120
drwxr-xr-x   - root root  0 2015-09-23 18:48
/user/root/checkpoint/d3714249-e03a-45c7-a0d5-1dc870b7d9f2
drwxr-xr-x   - root root  0 2015-09-23 17:44
/user/root/checkpoint/receivedBlockMetadata


I restart spark and it reads from
/user/root/checkpoint/d3714249-e03a-45c7-a0d5-1dc870b7d9f2. But it seems
that the data in it lost some rdds so it is not able to recovery. While I
find other directories in checkpoint/, like
 /user/root/checkpoint/checkpoint-144300120.  What does it used for?
Can I recovery my data from that?


Re: Checkpoint directory structure

2015-09-23 Thread Bin Wang
ll the application and restart it. Then the application
>> cannot recover from checkpoint because of some lost of RDD. So I'm wonder,
>> if there are some failure in the application, won't it possible not be able
>> to recovery from checkpoint?
>>
>> Bin Wang <wbi...@gmail.com>于2015年9月23日周三 下午6:58写道:
>>
>>> I find the checkpoint directory structure is like this:
>>>
>>> -rw-r--r--   1 root root 134820 2015-09-23 16:55
>>> /user/root/checkpoint/checkpoint-144299850
>>> -rw-r--r--   1 root root 134768 2015-09-23 17:00
>>> /user/root/checkpoint/checkpoint-144299880
>>> -rw-r--r--   1 root root 134895 2015-09-23 17:05
>>> /user/root/checkpoint/checkpoint-144299910
>>> -rw-r--r--   1 root root 134899 2015-09-23 17:10
>>> /user/root/checkpoint/checkpoint-144299940
>>> -rw-r--r--   1 root root 134913 2015-09-23 17:15
>>> /user/root/checkpoint/checkpoint-144299970
>>> -rw-r--r--   1 root root 134928 2015-09-23 17:20
>>> /user/root/checkpoint/checkpoint-14430
>>> -rw-r--r--   1 root root 134987 2015-09-23 17:25
>>> /user/root/checkpoint/checkpoint-144300030
>>> -rw-r--r--   1 root root 134944 2015-09-23 17:30
>>> /user/root/checkpoint/checkpoint-144300060
>>> -rw-r--r--   1 root root 134956 2015-09-23 17:35
>>> /user/root/checkpoint/checkpoint-144300090
>>> -rw-r--r--   1 root root 135244 2015-09-23 17:40
>>> /user/root/checkpoint/checkpoint-144300120
>>> drwxr-xr-x   - root root  0 2015-09-23 18:48
>>> /user/root/checkpoint/d3714249-e03a-45c7-a0d5-1dc870b7d9f2
>>> drwxr-xr-x   - root root  0 2015-09-23 17:44
>>> /user/root/checkpoint/receivedBlockMetadata
>>>
>>>
>>> I restart spark and it reads from
>>> /user/root/checkpoint/d3714249-e03a-45c7-a0d5-1dc870b7d9f2. But it seems
>>> that the data in it lost some rdds so it is not able to recovery. While I
>>> find other directories in checkpoint/, like
>>>  /user/root/checkpoint/checkpoint-144300120.  What does it used for?
>>> Can I recovery my data from that?
>>>
>>
>
Log Type: stderr
Log Upload Time: Wed Sep 23 17:47:51 +0800 2015
Log Length: 55303
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/yarn/nm/usercache/root/filecache/6753/spark-assembly-1.5.1-SNAPSHOT-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/opt/cloudera/parcels/CDH-5.4.1-1.cdh5.4.1.p0.6/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/09/23 17:47:28 INFO yarn.ApplicationMaster: Registered signal handlers for 
[TERM, HUP, INT]
15/09/23 17:47:31 INFO yarn.ApplicationMaster: ApplicationAttemptId: 
appattempt_1440495451668_0297_01
15/09/23 17:47:31 INFO spark.SecurityManager: Changing view acls to: yarn,root
15/09/23 17:47:31 INFO spark.SecurityManager: Changing modify acls to: yarn,root
15/09/23 17:47:31 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(yarn, root); users 
with modify permissions: Set(yarn, root)
15/09/23 17:47:32 INFO yarn.ApplicationMaster: Starting the user application in 
a separate Thread
15/09/23 17:47:32 INFO yarn.ApplicationMaster: Waiting for spark context 
initialization
15/09/23 17:47:32 INFO yarn.ApplicationMaster: Waiting for spark context 
initialization ... 
15/09/23 17:47:32 INFO streaming.CheckpointReader: Checkpoint files found: 
hdfs://szq2.appadhoc.com:8020/user/root/checkpoint/checkpoint-144300120,hdfs://szq2.appadhoc.com:8020/user/root/checkpoint/checkpoint-144300090,hdfs://szq2.appadhoc.com:8020/user/root/checkpoint/checkpoint-144300060,hdfs://szq2.appadhoc.com:8020/user/root/checkpoint/checkpoint-144300030,hdfs://szq2.appadhoc.com:8020/user/root/checkpoint/checkpoint-14430,hdfs://szq2.appadhoc.com:8020/user/root/checkpoint/checkpoint-144299970,hdfs://szq2.appadhoc.com:8020/user/root/checkpoint/checkpoint-144299940,hdfs://szq2.appadhoc.com:8020/user/root/checkpoint/checkpoint-144299910,hdfs://szq2.appadhoc.com:8020/user/root/checkpoint/checkpoint-144299880,hdfs://szq2.appadhoc.com:8020/user/root/checkpoint/checkpoint-144299850
15/09/23 17:47:32 INFO streaming.CheckpointReader: Attempting to load 
checkpoint from file 
hdfs://szq2.appadhoc.com:8020/user/root/checkpoint/checkpoint-144300120
15/09/23 17:47:33 INFO streaming.Checkpoint: Checkpoint for time 144300120 
ms validated
15/09/23 17:47:33 INFO streaming.Checkpoint

Re: Checkpoint directory structure

2015-09-23 Thread Bin Wang
BTW, I just kill the application and restart it. Then the application
cannot recover from checkpoint because of some lost of RDD. So I'm wonder,
if there are some failure in the application, won't it possible not be able
to recovery from checkpoint?

Bin Wang <wbi...@gmail.com>于2015年9月23日周三 下午6:58写道:

> I find the checkpoint directory structure is like this:
>
> -rw-r--r--   1 root root 134820 2015-09-23 16:55
> /user/root/checkpoint/checkpoint-144299850
> -rw-r--r--   1 root root 134768 2015-09-23 17:00
> /user/root/checkpoint/checkpoint-144299880
> -rw-r--r--   1 root root 134895 2015-09-23 17:05
> /user/root/checkpoint/checkpoint-144299910
> -rw-r--r--   1 root root 134899 2015-09-23 17:10
> /user/root/checkpoint/checkpoint-144299940
> -rw-r--r--   1 root root 134913 2015-09-23 17:15
> /user/root/checkpoint/checkpoint-144299970
> -rw-r--r--   1 root root 134928 2015-09-23 17:20
> /user/root/checkpoint/checkpoint-14430
> -rw-r--r--   1 root root 134987 2015-09-23 17:25
> /user/root/checkpoint/checkpoint-144300030
> -rw-r--r--   1 root root 134944 2015-09-23 17:30
> /user/root/checkpoint/checkpoint-144300060
> -rw-r--r--   1 root root 134956 2015-09-23 17:35
> /user/root/checkpoint/checkpoint-144300090
> -rw-r--r--   1 root root 135244 2015-09-23 17:40
> /user/root/checkpoint/checkpoint-144300120
> drwxr-xr-x   - root root  0 2015-09-23 18:48
> /user/root/checkpoint/d3714249-e03a-45c7-a0d5-1dc870b7d9f2
> drwxr-xr-x   - root root  0 2015-09-23 17:44
> /user/root/checkpoint/receivedBlockMetadata
>
>
> I restart spark and it reads from
> /user/root/checkpoint/d3714249-e03a-45c7-a0d5-1dc870b7d9f2. But it seems
> that the data in it lost some rdds so it is not able to recovery. While I
> find other directories in checkpoint/, like
>  /user/root/checkpoint/checkpoint-144300120.  What does it used for?
> Can I recovery my data from that?
>


Re: Checkpoint directory structure

2015-09-23 Thread Tathagata Das
Could you provide the logs on when and how you are seeing this error?

On Wed, Sep 23, 2015 at 6:32 PM, Bin Wang <wbi...@gmail.com> wrote:

> BTW, I just kill the application and restart it. Then the application
> cannot recover from checkpoint because of some lost of RDD. So I'm wonder,
> if there are some failure in the application, won't it possible not be able
> to recovery from checkpoint?
>
> Bin Wang <wbi...@gmail.com>于2015年9月23日周三 下午6:58写道:
>
>> I find the checkpoint directory structure is like this:
>>
>> -rw-r--r--   1 root root 134820 2015-09-23 16:55
>> /user/root/checkpoint/checkpoint-144299850
>> -rw-r--r--   1 root root 134768 2015-09-23 17:00
>> /user/root/checkpoint/checkpoint-144299880
>> -rw-r--r--   1 root root 134895 2015-09-23 17:05
>> /user/root/checkpoint/checkpoint-144299910
>> -rw-r--r--   1 root root 134899 2015-09-23 17:10
>> /user/root/checkpoint/checkpoint-144299940
>> -rw-r--r--   1 root root 134913 2015-09-23 17:15
>> /user/root/checkpoint/checkpoint-144299970
>> -rw-r--r--   1 root root 134928 2015-09-23 17:20
>> /user/root/checkpoint/checkpoint-14430
>> -rw-r--r--   1 root root 134987 2015-09-23 17:25
>> /user/root/checkpoint/checkpoint-144300030
>> -rw-r--r--   1 root root 134944 2015-09-23 17:30
>> /user/root/checkpoint/checkpoint-144300060
>> -rw-r--r--   1 root root 134956 2015-09-23 17:35
>> /user/root/checkpoint/checkpoint-144300090
>> -rw-r--r--   1 root root 135244 2015-09-23 17:40
>> /user/root/checkpoint/checkpoint-144300120
>> drwxr-xr-x   - root root  0 2015-09-23 18:48
>> /user/root/checkpoint/d3714249-e03a-45c7-a0d5-1dc870b7d9f2
>> drwxr-xr-x   - root root  0 2015-09-23 17:44
>> /user/root/checkpoint/receivedBlockMetadata
>>
>>
>> I restart spark and it reads from
>> /user/root/checkpoint/d3714249-e03a-45c7-a0d5-1dc870b7d9f2. But it seems
>> that the data in it lost some rdds so it is not able to recovery. While I
>> find other directories in checkpoint/, like
>>  /user/root/checkpoint/checkpoint-144300120.  What does it used for?
>> Can I recovery my data from that?
>>
>