Raghu Angadi created SPARK-45138:
------------------------------------
Summary: Define a new error class and apply for the case where
checkpointing state to DFS fails
Key: SPARK-45138
URL: https://issues.apache.org/jira/browse/SPARK-45138
Project: Spark
Issue Type: Task
Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Raghu Angadi
[From Neil Ramswamy]:
When there is an exception during storing state from DFS (Hadoop, object store,
etc), we just propagate the exception without the context. There is a context
if customers look into stack trace, but that is definitely not something we
want customers to look into by themselves.
For example, let’s say file system related exception happened during
checkpointing state. Since we just let the exception be bubbled up to the very
top without adding any context, it is quite confusing for customers to quickly
indicate whether there is an issue with source/sink (if source/sink data source
is based on file), or offset/commit log, or state load/checkpoint. Too many
operations can throw the same exception.
The ticket aims to wrap the exception during the commit of the state, to assign
error class properly. With assigning error class, we can classify the errors
which help us to determine what errors customers are struggling much.
StateStore.commit() is the entry point. Each state store provider has its own
implementation, but use the same error class across implementations, as we want
to categorize it as the same. Even better if we can put it to the higher-level
of caller, but would be OK to handle it in built-in impls.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]