Github user ankurdave commented on the pull request:

    https://github.com/apache/spark/pull/1297#issuecomment-69475120
  
    @octavian-ganea IndexedRDD creates a new lineage entry for each operation. 
This enables fault tolerance but, as with other iterative Spark programs, 
causes stack overflows when the lineage chain gets too long. There are two ways 
to mitigate this:
    
    1. 
[Checkpoint](http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing)
 the lineage periodically (every 50-100 iterations) using 
`sc.setCheckpointDir(...)` and `indexed.checkpoint()`. This will write each 
checkpointed RDD to disk in full. There isn't yet support for incremental 
checkpoints that only write the changes since the last checkpoint.
    
    2. Reduce the lineage length by batching operations using multiput or join. 
This will also improve per-operation performance by amortizing the cost of the 
lineage entry across all batched elements.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to