Hi all,
recently I've been struggling to adapt Java's KafkaIOIT to work with a
large dataset generated by a SyntheticSource. I want to push 100M records
through a Kafka topic and verify data correctness and at the same time
check the performance of KafkaIO.Write and KafkaIO.Read.

To perform the tests I'm using a Kafka cluster on Kubernetes from the Beam
repo (here
<https://github.com/apache/beam/tree/master/.test-infra/kubernetes/kafka-cluster>
).

To give you an overview of what should happen, first the records are
generated in a deterministic way (using hashes of list positions as Random
seeds), next they are written to Kafka - this concludes the write pipeline.
As for reading and correctness checking - first, the data is read from the
topic and after being decoded into String representations, a hashcode of
the whole PCollection is calculated (For details, check KafkaIOIT.java).

During the testing I ran into several problems:
1. When all the records are read from the Kafka topic, the hash is
different each time.
2. Sometimes not all the records are read and the Dataflow task waits for
the input indefinitely, occasionally throwing exceptions.

I suspect that something is wrong with the Kafka cluster configuration, but
unfortunately, I lack experience with this tool to know what could be
missing here.
I would be very grateful for your help, Kafka config hints or anything else
you can add.

Thanks and have a good day,

Michal

-- 

Michał Walenia
Polidea <https://www.polidea.com/> | Software Engineer

M: +48 791 432 002 <+48791432002>
E: michal.wale...@polidea.com

Unique Tech
Check out our projects! <https://www.polidea.com/our-work>

Reply via email to