viirya commented on pull request #32136:
URL: https://github.com/apache/spark/pull/32136#issuecomment-846358670


   > I'm not sure about the scenario of leveraging PVC as checkpoint location - 
at least that sounds to me as beyond the support of checkpoint in Structured 
Streaming.
   
   I agree on this, and yes, this is the current status. So that is said we are 
going to propose a new approach to support checkpoint in Structured Streaming. 
Unfortunately due to that fact that scheduling is bound to stateful tasks (i.e. 
state store locations), we cannot achieve the goal without touching other 
modules, like core.
    
   > I'm more likely novice on cloud/k8s, but from the common sense, I guess 
the actual storage of PVC should be still a sort of network storage to be 
resilient on "physical node down". I'm wondering how much benefits PVC approach 
gives compared to the existing approach as just directly use remote 
fault-tolerant file system. The benefits should be clear to cope with 
additional complexity.
   
   Technically, PVC is kinds of abstract way to look at the volume mounted on 
container running executor. It could be local storage on nodes on k8s. It 
depends where the PVC is bound to.
   
   HDFS becomes a bottleneck for our streaming jobs. The throughput to HDFS, 
the number of files as loading on name nodes, these are serious issues to use 
it as checkpoint destination for heavy streaming jobs in scale. Using PVC as 
checkpoint could be huge relief on the loading of HDFS. There are also others 
like better latency, simplified streaming architecture. Personally I think this 
is enough benefits as the motivation of our proposal.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to