[
https://issues.apache.org/jira/browse/FLINK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720610#comment-17720610
]
Piotr Nowojski edited comment on FLINK-31963 at 5/9/23 3:23 PM:
----------------------------------------------------------------
So far I was not able to reproduce this :(
Additionally to what [~srichter] asked. [~tanee.kim], would it be possible for
you to provide the checkpoint files from when the failure was happening, so
that we could reproduce it more easily?
Secondly, a random guess. Can someone verify if setting
{{execution.checkpointing.unaligned.max-subtasks-per-channel-state-file}} to 1
stops this issue from reoccurring? [1]
[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#execution-checkpointing-unaligned-max-subtasks-per-channel-state
was (Author: pnowojski):
Additionally to what [~srichter] asked. [~tanee.kim], would it be possible for
you to provide the checkpoint files from when the failure was happening, so
that we could reproduce it more easily? So far I was not able to reproduce this
:(
> java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned
> checkpoints
> -------------------------------------------------------------------------------------
>
> Key: FLINK-31963
> URL: https://issues.apache.org/jira/browse/FLINK-31963
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.17.0
> Environment: Flink: 1.17.0
> FKO: 1.4.0
> StateBackend: RocksDB(Genetic Incremental Checkpoint & Unaligned Checkpoint
> enabled)
> Reporter: Tan Kim
> Priority: Critical
> Labels: stability
> Attachments: image-2023-04-29-02-49-05-607.png, jobmanager_error.txt,
> taskmanager_error.txt
>
>
> I'm testing Autoscaler through Kubernetes Operator and I'm facing the
> following issue.
> As you know, when a job is scaled down through the autoscaler, the job
> manager and task manager go down and then back up again.
> When this happens, an index out of bounds exception is thrown and the state
> is not restored from a checkpoint.
> [~gyfora] told me via the Flink Slack troubleshooting channel that this is
> likely an issue with Unaligned Checkpoint and not an issue with the
> autoscaler, but I'm opening a ticket with Gyula for more clarification.
> Please see the attached JM and TM error logs.
> Thank you.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)