GitHub user tdas opened a pull request:
https://github.com/apache/spark/pull/9988
[SPARK-11932][STREAMING] Partition previous state RDD if partitioner not
present
The reason is that TrackStateRDDs generated by trackStateByKey expect the
previous batch's TrackStateRDDs to have a partitioner. However, when recovery
from DStream checkpoints, the RDDs recovered from RDD checkpoints do not have a
partitioner attached to it. This is because RDD checkpoints do not preserve the
partitioner (SPARK-12004).
While #9983 solves SPARK-12004 by preserving the partitioner through RDD
checkpoints, there may be a non-zero chance that the saving and recovery fails.
To be resilient, this PR repartitions the previous state RDD if the partitioner
is not detected.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tdas/spark SPARK-11932
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/9988.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #9988
----
commit 0c5fe55b1ff8da4cca28b91860afbcfbd28e7422
Author: Tathagata Das <[email protected]>
Date: 2015-11-26T01:54:44Z
Partition previous state RDD if partitioner not present
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]