[
https://issues.apache.org/jira/browse/SPARK-17417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15546792#comment-15546792
]
Dhruve Ashar commented on SPARK-17417:
--------------------------------------
[~srowen] AFAIU the checkpointing mechanism in spark core, the recovery of an
RDD from a checkpoint is limited to an application attempt. Spark streaming
mentions that it can recover metadata/rdd from checkpointed data across
application attempts. Please correct me if I have missed something here. With
this understanding it wouldn't be necessary to parse the code for the old
format as the recovery would be done using the same spark jar which was used to
launch it.
Also why is it that we are not cleaning up the checkpointed directory on
sc.close ?
> Fix # of partitions for RDD while checkpointing - Currently limited by
> 10000(%05d)
> ----------------------------------------------------------------------------------
>
> Key: SPARK-17417
> URL: https://issues.apache.org/jira/browse/SPARK-17417
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Reporter: Dhruve Ashar
>
> Spark currently assumes # of partitions to be less than 100000 and uses %05d
> padding.
> If we exceed this no., the sort logic in ReliableCheckpointRDD gets messed up
> and fails. This is because of part-files are sorted and compared as strings.
> This leads filename order to be part-10000, part-100000, ... instead of
> part-10000, part-10001, ..., part-100000 and while reconstructing the
> checkpointed RDD the job fails.
> Possible solutions:
> - Bump the padding to allow more partitions or
> - Sort the part files extracting a sub-portion as string and then verify the
> RDD
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]