[
https://issues.apache.org/jira/browse/SPARK-56720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-56720:
-----------------------------------
Labels: pull-request-available (was: )
> Fail subsequent async log writes after a prior failure in async progress
> tracking
> ---------------------------------------------------------------------------------
>
> Key: SPARK-56720
> URL: https://issues.apache.org/jira/browse/SPARK-56720
> Project: Spark
> Issue Type: Bug
> Components: Structured Streaming
> Affects Versions: 3.4.0
> Reporter: Yuchen Liu
> Priority: Major
> Labels: pull-request-available
>
> When async progress tracking is enabled, offset and commit log writes are
> submitted to a single-threaded executor in {{AsyncOffsetSeqLog}} /
> {{{}AsyncCommitLog{}}}. If one async write task fails (e.g. an HDFS
> {{Permission denied}} or other {{{}IOException{}}}), follow-up tasks already
> queued — or queued before the main thread re-checks {{errorNotifier}} at the
> next batch boundary — still execute and may successfully persist files to
> durable storage. This produces two correctness/observability problems:
> # Gaps on durable storage. The offset log may be missing batch _N_ while
> batch _N+1_ is present, or a commit-log entry can be written without its
> corresponding offset-log entry. This violates the invariant that the commit
> log is a prefix of the offset log on disk.
> # Root cause is masked. {{ErrorNotifier.markError}} overwrites previously
> stored errors, so a later cascading failure (e.g.
> {{{}concurrentStreamLogUpdate{}}}) can replace the original {{{}Permission
> denied{}}}/{{{}IOException{}}} and surface as the user-visible
> {{StreamingQueryException}} cause.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]