viirya commented on pull request #32136:
URL: https://github.com/apache/spark/pull/32136#issuecomment-846358670
> I'm not sure about the scenario of leveraging PVC as checkpoint location -
at least that sounds to me as beyond the support of checkpoint in Structured
Streaming.
I agree on this, and yes, this is the current status. So that is said we are
going to propose a new approach to support checkpoint in Structured Streaming.
Unfortunately due to that fact that scheduling is bound to stateful tasks (i.e.
state store locations), we cannot achieve the goal without touching other
modules, like core.
> I'm more likely novice on cloud/k8s, but from the common sense, I guess
the actual storage of PVC should be still a sort of network storage to be
resilient on "physical node down". I'm wondering how much benefits PVC approach
gives compared to the existing approach as just directly use remote
fault-tolerant file system. The benefits should be clear to cope with
additional complexity.
Technically, PVC is kinds of abstract way to look at the volume mounted on
container running executor. It could be local storage on nodes on k8s. It
depends where the PVC is bound to.
HDFS becomes a bottleneck for our streaming jobs. The throughput to HDFS,
the number of files as loading on name nodes, these are serious issues to use
it as checkpoint destination for heavy streaming jobs in scale. Using PVC as
checkpoint could be huge relief on the loading of HDFS. There are also others
like better latency, simplified streaming architecture. Personally I think this
is enough benefits as the motivation of our proposal.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]