Hi Iceberg devs, I opened https://github.com/apache/iceberg/issues/16744 to propose improving commit failure messages when commit retries are exhausted.
Today, when a commit fails after exhausting commit.retry.* backoff, Iceberg rethrows the underlying CommitFailedException, but the final message does not make it clear why the retry loop stopped. That makes it hard for clients and operators to know whether they should tune: - commit.retry.num-retries - commit.retry.total-timeout-ms - commit.retry.min-wait-ms / commit.retry.max-wait-ms I’d like to improve this by exposing whether commit retries stopped because the attempt budget was exhausted, the total retry timeout was exceeded, or both. My current thinking is: - classify retry exhaustion in Tasks - preserve the original exception as the cause - translate the exhaustion reason into commit specific guidance at commit call sites For example, the final commit exception could include guidance like “increase commit.retry.num-retries” when the attempt limit is reached, or “increase commit.retry.total-timeout-ms” when the elapsed retry timeout is reached. I’d appreciate feedback on whether this direction makes sense, especially around where the retry exhaustion classification should live and how much detail should be surfaced in the final exception message. Thanks, Joana
