Re: [I] [Bug]: KinesisIO source on FlinkRunner initializes the same splits twice [beam]

via GitHub Sat, 25 May 2024 00:07:10 -0700


je-ik commented on issue #31313:
URL: https://github.com/apache/beam/issues/31313#issuecomment-2130993220

> Im surprised this is a bug considering restoring from a flink savepoint is
a pretty common use case, is it possible there some configuration missing
somewhere? I havent been able to find anyone else online experiencing this same
issue but I was able to replicate it using both kinesis and kafka. Given how
common of a use case it is, Im not 100% sure I believe this is in fact a bug
and most likely some user error on my part.

Can you please provide a minimal example and setup to reproduce the behavior?

> I can make do without savepoints by utilizing kafka offset commits and
consumer groups to ensure no data is lost, but cant figure out a way to not
lose data that is windowed but not triggered when the flink application is
stopped. Maybe you know of a solution to that problem?

You can drain the Pipeline, see
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/cli/#terminating-a-job

> it seems like a lot of the subtasks arent being utilized when stripping
IDs with beam_fn_api despite the number of shards being 20 and parallelism
being 24 (in theory should only be 4 idle subtasks)

This is related to how Flink computes target splits. It is affected by
maximal parallelism (which is computed automatically, if not specified). You
can try increasing it via `--setMaxParallelism=32768` (32768 is maximal value),
this could make the assignment more balanced.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Bug]: KinesisIO source on FlinkRunner initializes the same splits twice [beam]

Reply via email to