[jira] [Commented] (FLINK-31963) java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned checkpoints

Piotr Nowojski (Jira) Tue, 02 May 2023 08:27:59 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718610#comment-17718610
 ]


Piotr Nowojski commented on FLINK-31963:
----------------------------------------

Thanks for the answers [~tanee.kim]

{quote}
It happens a couple of times, but not always.
{quote}
But once it happened once during a recovery from an unaligned checkpoint, does 
it happen always for that same checkpoint? Or even that is indeterministic and 
retrying recovery from the same checkpoint can sucede? 
{quote}
A question unrelated to this ticket, but if the subtasks that exist in the 
above jobgraph all appear to be one, why is that?
In order to do source scaling, the outputRecords value needs to be non-zero, 
but since the downstream after the kafka source stream is not separated on the 
jobgraph, the outputRecords is getting zero, so we explicitly added a keyBy 
operator to the kafka source stream so that we can intentionally separate them 
and then calculate the outputRecords value.
(I don't think this is very good for performance) Is there any other way to 
ensure that the streams are separated into two at the desired location in the 
jobgraph?
{quote}
You can brake chains via {{startNewChain}} or {{disableChaining}} 
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/overview/#task-chaining-and-resource-groups
 . However this doesn't seem like the right think to do. What do you mean by 
{{outputRecords}}? {{numRecordsOut}} metric should be available for all 
operators, including chained source operators.

> java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned 
> checkpoints
> -------------------------------------------------------------------------------------
>
>                 Key: FLINK-31963
>                 URL: https://issues.apache.org/jira/browse/FLINK-31963
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.17.0
>         Environment: Flink: 1.17.0
> FKO: 1.4.0
> StateBackend: RocksDB(Genetic Incremental Checkpoint & Unaligned Checkpoint 
> enabled)
>            Reporter: Tan Kim
>            Priority: Critical
>              Labels: stability
>         Attachments: image-2023-04-29-02-49-05-607.png, jobmanager_error.txt, 
> taskmanager_error.txt
>
>
> I'm testing Autoscaler through Kubernetes Operator and I'm facing the 
> following issue.
> As you know, when a job is scaled down through the autoscaler, the job 
> manager and task manager go down and then back up again.
> When this happens, an index out of bounds exception is thrown and the state 
> is not restored from a checkpoint.
> [~gyfora] told me via the Flink Slack troubleshooting channel that this is 
> likely an issue with Unaligned Checkpoint and not an issue with the 
> autoscaler, but I'm opening a ticket with Gyula for more clarification.
> Please see the attached JM and TM error logs.
> Thank you.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-31963) java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned checkpoints

Reply via email to