[ 
https://issues.apache.org/jira/browse/BEAM-8121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16921443#comment-16921443
 ] 

Alexey Romanenko commented on BEAM-8121:
----------------------------------------

Thank you for adding the files, I don't see something suspicious in your 
pipeline at the first sight. Maybe just try to make as simple as possible and 
see if you still have the same effects.

In general, I'd not say it's an issue caused by KafkaIO since this is not its 
responsibility to distribute data among workers depending on workload.

I can't say for sure why your CPU load has been increased so much with 
Reshuffle, it could be a question for Google Dataflow engineers. Under the 
hood, Reshuffle still uses GroupByKey and this is a recommended way to break a 
fusion in [Dataflow 
documentation|https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#fusion-optimization].
 Also, you can try to use "combined" approach using SideInput and Reshuffle, as 
it's done in 
[JdbcIO.Reparallelize|https://github.com/apache/beam/blob/59632eead48ecab37c698ce0e11e40fdfc5a603f/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java#L1304]

> Messages are not distributed per machines when consuming from Kafka topic 
> with 1 partition
> ------------------------------------------------------------------------------------------
>
>                 Key: BEAM-8121
>                 URL: https://issues.apache.org/jira/browse/BEAM-8121
>             Project: Beam
>          Issue Type: Bug
>          Components: io-java-kafka
>    Affects Versions: 2.14.0
>            Reporter: TJ
>            Priority: Major
>         Attachments: datalake-dataflow-cleaned.zip
>
>
> Messages are consumed from Kafka using KafkaIO. Each kafka topic contains 
> only 1 partition. (That means that messages can be consumed only by one 
> Consumer per 1 consumer group)
> When backlog of topic grows and system scales from 1 to X machines, all the 
> messages seems to be executed onĀ  the same machine on which they are read. 
> Due to that message throughput doesn't increase comparing X machines to 1 
> machine. If one machine was reading 2K messagesĀ  per s, X machines will be 
> reading the same amount.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to