[jira] [Comment Edited] (FLINK-31963) java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned checkpoints

Tan Kim (Jira) Fri, 05 May 2023 19:25:05 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720049#comment-17720049
 ]


Tan Kim edited comment on FLINK-31963 at 5/6/23 2:24 AM:
---------------------------------------------------------

Q1

If it happens once during recovery from an unaligned checkpoint, it will always 
happen from the same checkpoint.

Q2

If the numRecordsOut metric applies to all operators, including chaining, then 
I may have jumped the gun.
Since scaling usually takes time and source & downstream scaling conditions are 
different, I guess I should have monitored it more closely.

Can you explain the difference between Vertex and Operator?
Scaling is done on a per-Vertex basis in the JobGraph, but if chaining is 
applied, are multiple Operators considered as one Vertex and therefore not 
subject to Source & Downstream scaling?

Thanks for answering my question.


was (Author: JIRAUSER300108):
Q1

If it happens once during recovery from an unaligned checkpoint, it will always 
happen from the same checkpoint.

Q2

If the numRecordsOut metric applies to all operators, including chaining, then 
I may have jumped the gun.
Since scaling usually takes time and source & downstream scaling conditions are 
different, I guess I should have monitored it more closely.
Thanks for answering my question.

> java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned 
> checkpoints
> -------------------------------------------------------------------------------------
>
>                 Key: FLINK-31963
>                 URL: https://issues.apache.org/jira/browse/FLINK-31963
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.17.0
>         Environment: Flink: 1.17.0
> FKO: 1.4.0
> StateBackend: RocksDB(Genetic Incremental Checkpoint & Unaligned Checkpoint 
> enabled)
>            Reporter: Tan Kim
>            Priority: Critical
>              Labels: stability
>         Attachments: image-2023-04-29-02-49-05-607.png, jobmanager_error.txt, 
> taskmanager_error.txt
>
>
> I'm testing Autoscaler through Kubernetes Operator and I'm facing the 
> following issue.
> As you know, when a job is scaled down through the autoscaler, the job 
> manager and task manager go down and then back up again.
> When this happens, an index out of bounds exception is thrown and the state 
> is not restored from a checkpoint.
> [~gyfora] told me via the Flink Slack troubleshooting channel that this is 
> likely an issue with Unaligned Checkpoint and not an issue with the 
> autoscaler, but I'm opening a ticket with Gyula for more clarification.
> Please see the attached JM and TM error logs.
> Thank you.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-31963) java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned checkpoints

Reply via email to