[
https://issues.apache.org/jira/browse/FLINK-36884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yanfei Lei updated FLINK-36884:
-------------------------------
Component/s: Runtime / REST
> GCS 503 Error Codes for `flink-checkpoints/<id>/shared/` file after upload
> complete
> -----------------------------------------------------------------------------------
>
> Key: FLINK-36884
> URL: https://issues.apache.org/jira/browse/FLINK-36884
> Project: Flink
> Issue Type: Bug
> Components: FileSystems, Runtime / Checkpointing, Runtime / REST
> Affects Versions: 1.18.0
> Environment: We are using Flink 1.18.0 with the gs-plugin.
> It is a rare bug but something we have noticed multiple times.
> Reporter: Ryan van Huuksloot
> Priority: Minor
> Attachments: Screenshot 2024-12-10 at 1.46.06 PM.png
>
>
> We had a Flink pipeline that started to, all of a sudden, fail on a single
> subtask [Image 1]. It does not block the rest of the DAG checkpointing so the
> checkpoint barriers are continuing on.
> We investigated the issue and found that the checkpoint was trying to write
> over and over and over. It retried writing the file thousands of times. And
> the issue persisted across checkpoints and savepoints but only failed for one
> specific file.
> An example log:
>
> {code:java}
> Dec 10, 2024 6:06:05 PM
> com.google.cloud.hadoop.util.RetryHttpInitializer$LoggingResponseHandler
> handleResponse
> INFO: Encountered status code 503 when sending PUT request to URL
> 'https://storage.googleapis.com/upload/storage/v1/b/<bucket>/o?ifGenerationMatch=0&name=flink-checkpoints/2394318276860454f7b6d1689f770796/shared/7d6bb60b-e0cf-4873-afc1-f2d785a4418e&uploadType=resumable&upload_id=<upload_id>'.
> Delegating to response handler for possible retry.
> ...{code}
>
> {*}It is important to note that the file was in fact there. I am not sure if
> it was complete however it was not an .inprogress.file so I believe it was
> complete{*}.
>
> I even tried deleting the file in GCS and waiting for a new checkpoint to
> occur and the same issue persisted.
>
> There is no issue when we restarted the job from a savepoint. There seems to
> be only an issue with a very specific file.
>
> I also tried it locally. It had a 503 from this endpoint with the same
> upload_id
> {noformat}
> https://storage.googleapis.com/upload/storage/v1/<bucket>{noformat}
> However worked fine with this API (with a new upload_id)
> {noformat}
> https://storage.googleapis.com/<path>{noformat}
> I could not find the merged file on the Task Manager to try from the pod when
> it was failing.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)