[
https://issues.apache.org/jira/browse/FLINK-34589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824248#comment-17824248
]
Xintong Song commented on FLINK-34589:
--------------------------------------
[~mapohl], thanks for reaching out.
Do you already see errors thrown from the reconciliation? Or you just noticed
the absence of safe-net but don't observe any errors so far? Just trying to
understand what possible errors would there be.
My understanding is that, ideally we expect no exceptions thrown from the
reconciliation? If it does, then there might be some possibilities that we are
not yet aware of. In such case, I'd be in favor of fail eagerly so that we
don't ignore the problem. Thus, I'd be in favor of option 1.
But it's not a strong preference and I'd also be fine with option 2.
> FineGrainedSlotManager doesn't handle errors in the resource reconcilliation
> step
> ---------------------------------------------------------------------------------
>
> Key: FLINK-34589
> URL: https://issues.apache.org/jira/browse/FLINK-34589
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.19.0, 1.18.1, 1.20.0
> Reporter: Matthias Pohl
> Priority: Major
>
> I noticed during my work on FLINK-34427 that the reconcilliation is scheduled
> periodically when starting the {{SlotManager}}. But it doesn't handle errors
> in this step. I see two options here:
> 1. Fail fatally because such an error might indicate a major issue with the
> RM backend.
> 2. Log the failure and continue the scheduled task even in case of an error.
> My understanding is that we're just not able to recreate TaskManagers which
> should be a transient issue and could be resolved in the backend (YARN, k8s).
> That's why I would lean towards option 2.
> [~xtsong] WDYT?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)