[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

baluchicken Tue, 17 Jul 2018 13:23:32 -0700

Github user baluchicken commented on the issue:

    https://github.com/apache/spark/pull/21067
  
    I ran some more tests about this. I think we can say that this change can 
add resiliency to spark batch jobs where just like in case of YARN Spark will 
retry the job from the beginning if an error happened. 
    
    Also it can add resiliency to the Spark Streaming apps. I fully understand 
your concerns but if someone is going to submit a  resilient Spark Streaming 
app it will use Spark feature Checkpointing. For checkpointing he/she should 
use some kind of PersistentVolume otherwise all info saved to this dir will be 
lost in case of a node failure. For PVC the accessMode should be a 
ReadWriteOnce solution because for this amount of data it is way faster than 
the ReadWriteMany ones. 
    
    My new tests used the same approach described above with one modifications 
I enabled the checkpointing dir backed with ReadWriteOnce PVC. ReadWriteOnce 
storage can only be attached to one node. I thought Kubernetes will detach this 
volume once the Node become "NotReady", but other thing happened. Kubernetes 
does not detached the volume from the unknown node so despite of the Job 
Controller created a new driver pod to replace the Unknown one it remained in 
Pending state because of required PVC still attached to a different node. Once 
the partitioned node become available again the unknown old driver pod got 
terminated, the volume got unattached and get reattached to the new driver pod 
which state now changed from pending to running.
    
    So I think there is no problem with the correctness here. We can maybe add 
a warning to the documentation that if someone wants to use a ReadWriteMany 
backed checkpoint dir correctness issue may arise, but otherwise maybe I am 
still missing something but I think don't.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

Reply via email to