[ 
https://issues.apache.org/jira/browse/FLINK-36884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanfei Lei updated FLINK-36884:
-------------------------------
    Component/s: Runtime / REST

> GCS 503 Error Codes for `flink-checkpoints/<id>/shared/` file after upload 
> complete
> -----------------------------------------------------------------------------------
>
>                 Key: FLINK-36884
>                 URL: https://issues.apache.org/jira/browse/FLINK-36884
>             Project: Flink
>          Issue Type: Bug
>          Components: FileSystems, Runtime / Checkpointing, Runtime / REST
>    Affects Versions: 1.18.0
>         Environment: We are using Flink 1.18.0 with the gs-plugin.
> It is a rare bug but something we have noticed multiple times.
>            Reporter: Ryan van Huuksloot
>            Priority: Minor
>         Attachments: Screenshot 2024-12-10 at 1.46.06 PM.png
>
>
> We had a Flink pipeline that started to, all of a sudden, fail on a single 
> subtask [Image 1]. It does not block the rest of the DAG checkpointing so the 
> checkpoint barriers are continuing on.
> We investigated the issue and found that the checkpoint was trying to write 
> over and over and over. It retried writing the file thousands of times. And 
> the issue persisted across checkpoints and savepoints but only failed for one 
> specific file. 
> An example log:
>  
> {code:java}
> Dec 10, 2024 6:06:05 PM 
> com.google.cloud.hadoop.util.RetryHttpInitializer$LoggingResponseHandler 
> handleResponse
> INFO: Encountered status code 503 when sending PUT request to URL 
> 'https://storage.googleapis.com/upload/storage/v1/b/<bucket>/o?ifGenerationMatch=0&name=flink-checkpoints/2394318276860454f7b6d1689f770796/shared/7d6bb60b-e0cf-4873-afc1-f2d785a4418e&uploadType=resumable&upload_id=<upload_id>'.
>  Delegating to response handler for possible retry.
> ...{code}
>  
> {*}It is important to note that the file was in fact there. I am not sure if 
> it was complete however it was not an .inprogress.file so I believe it was 
> complete{*}.
>  
> I even tried deleting the file in GCS and waiting for a new checkpoint to 
> occur and the same issue persisted.
>  
> There is no issue when we restarted the job from a savepoint. There seems to 
> be only an issue with a very specific file.
>  
> I also tried it locally. It had a 503 from this endpoint with the same 
> upload_id
> {noformat}
> https://storage.googleapis.com/upload/storage/v1/<bucket>{noformat}
> However worked fine with this API (with a new upload_id)
> {noformat}
> https://storage.googleapis.com/<path>{noformat}
> I could not find the merged file on the Task Manager to try from the pod when 
> it was failing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to