[ 
https://issues.apache.org/jira/browse/BEAM-8121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16921555#comment-16921555
 ] 

Raghu Angadi commented on BEAM-8121:
------------------------------------

Thanks for filing this [~TauJan]. 
{quote}With single machine it was consuming less messages. 1 machine without 
reshuffle step ±1500message/s, 1 machine with reshuffle step ±450message/s. 
When it scaled to 5 it was consuming the same 1500message/s as without the step.
{quote}
The above indicates 1500/s is likely the bottleneck in your BigQuery sink. I 
think there are metrics Dataflow that points to the stage that is the 
bottleneck. Dataflow support can also confirm this from the metrics. If the 
downstream (in this case BigQueryIO stage), is the bottleneck, you would see 
that that stage's backlog to be high (You already mentioned 2-3 minute lag on 
reshuffle step).

One way to confirm this is to get stacktraces of your worker  using jstack 
(either with or without Reshuffle). You would likely notice that 'ConsumerLoop' 
thread in Kafka is not waiting on read from Kafka, but is waiting to enqueue 
the messages it has already read to the processing thread. This indicates the 
bottleneck is downstream processing.  I think we added some stats to KakfaIO to 
show this (Dataflow main job page shows some of these metrics). If you are able 
get stacktraces, please collect a few of them a few seconds apart.
{quote}Standalone kafka client was consuming 5000 message/s 
Dataflow pipeline with few empty tasks 3200message/s
{quote}
If we want to dig deeper into KafkaIO performance itself, this is a much better 
comparison, where there is minimal application processing past reading the 
messages. Please post the code for the standalone Kafka client app and simple 
Dataflow app. KafkaIO in Dataflow also uses standard Kafka client for IO. There 
are multiple threads involved in order to minimize latencies, and I expect 
there to be a bit of overhead of Dataflow Runner and KafkaIO, but it should be 
comparable. May be around 10%, but your seeing more than that. 

> Messages are not distributed per machines when consuming from Kafka topic 
> with 1 partition
> ------------------------------------------------------------------------------------------
>
>                 Key: BEAM-8121
>                 URL: https://issues.apache.org/jira/browse/BEAM-8121
>             Project: Beam
>          Issue Type: Bug
>          Components: io-java-kafka
>    Affects Versions: 2.14.0
>            Reporter: TJ
>            Priority: Major
>         Attachments: datalake-dataflow-cleaned.zip
>
>
> Messages are consumed from Kafka using KafkaIO. Each kafka topic contains 
> only 1 partition. (That means that messages can be consumed only by one 
> Consumer per 1 consumer group)
> When backlog of topic grows and system scales from 1 to X machines, all the 
> messages seems to be executed on  the same machine on which they are read. 
> Due to that message throughput doesn't increase comparing X machines to 1 
> machine. If one machine was reading 2K messages  per s, X machines will be 
> reading the same amount.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to