[
https://issues.apache.org/jira/browse/BEAM-8121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16921443#comment-16921443
]
Alexey Romanenko commented on BEAM-8121:
----------------------------------------
Thank you for adding the files, I don't see something suspicious in your
pipeline at the first sight. Maybe just try to make as simple as possible and
see if you still have the same effects.
In general, I'd not say it's an issue caused by KafkaIO since this is not its
responsibility to distribute data among workers depending on workload.
I can't say for sure why your CPU load has been increased so much with
Reshuffle, it could be a question for Google Dataflow engineers. Under the
hood, Reshuffle still uses GroupByKey and this is a recommended way to break a
fusion in [Dataflow
documentation|https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#fusion-optimization].
Also, you can try to use "combined" approach using SideInput and Reshuffle, as
it's done in
[JdbcIO.Reparallelize|https://github.com/apache/beam/blob/59632eead48ecab37c698ce0e11e40fdfc5a603f/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java#L1304]
> Messages are not distributed per machines when consuming from Kafka topic
> with 1 partition
> ------------------------------------------------------------------------------------------
>
> Key: BEAM-8121
> URL: https://issues.apache.org/jira/browse/BEAM-8121
> Project: Beam
> Issue Type: Bug
> Components: io-java-kafka
> Affects Versions: 2.14.0
> Reporter: TJ
> Priority: Major
> Attachments: datalake-dataflow-cleaned.zip
>
>
> Messages are consumed from Kafka using KafkaIO. Each kafka topic contains
> only 1 partition. (That means that messages can be consumed only by one
> Consumer per 1 consumer group)
> When backlog of topic grows and system scales from 1 to X machines, all the
> messages seems to be executed onĀ the same machine on which they are read.
> Due to that message throughput doesn't increase comparing X machines to 1
> machine. If one machine was reading 2K messagesĀ per s, X machines will be
> reading the same amount.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)