[
https://issues.apache.org/jira/browse/FLINK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718610#comment-17718610
]
Piotr Nowojski commented on FLINK-31963:
----------------------------------------
Thanks for the answers [~tanee.kim]
{quote}
It happens a couple of times, but not always.
{quote}
But once it happened once during a recovery from an unaligned checkpoint, does
it happen always for that same checkpoint? Or even that is indeterministic and
retrying recovery from the same checkpoint can sucede?
{quote}
A question unrelated to this ticket, but if the subtasks that exist in the
above jobgraph all appear to be one, why is that?
In order to do source scaling, the outputRecords value needs to be non-zero,
but since the downstream after the kafka source stream is not separated on the
jobgraph, the outputRecords is getting zero, so we explicitly added a keyBy
operator to the kafka source stream so that we can intentionally separate them
and then calculate the outputRecords value.
(I don't think this is very good for performance) Is there any other way to
ensure that the streams are separated into two at the desired location in the
jobgraph?
{quote}
You can brake chains via {{startNewChain}} or {{disableChaining}}
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/overview/#task-chaining-and-resource-groups
. However this doesn't seem like the right think to do. What do you mean by
{{outputRecords}}? {{numRecordsOut}} metric should be available for all
operators, including chained source operators.
> java.lang.ArrayIndexOutOfBoundsException when scaling down with unaligned
> checkpoints
> -------------------------------------------------------------------------------------
>
> Key: FLINK-31963
> URL: https://issues.apache.org/jira/browse/FLINK-31963
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.17.0
> Environment: Flink: 1.17.0
> FKO: 1.4.0
> StateBackend: RocksDB(Genetic Incremental Checkpoint & Unaligned Checkpoint
> enabled)
> Reporter: Tan Kim
> Priority: Critical
> Labels: stability
> Attachments: image-2023-04-29-02-49-05-607.png, jobmanager_error.txt,
> taskmanager_error.txt
>
>
> I'm testing Autoscaler through Kubernetes Operator and I'm facing the
> following issue.
> As you know, when a job is scaled down through the autoscaler, the job
> manager and task manager go down and then back up again.
> When this happens, an index out of bounds exception is thrown and the state
> is not restored from a checkpoint.
> [~gyfora] told me via the Flink Slack troubleshooting channel that this is
> likely an issue with Unaligned Checkpoint and not an issue with the
> autoscaler, but I'm opening a ticket with Gyula for more clarification.
> Please see the attached JM and TM error logs.
> Thank you.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)