Hi all, recently I've been struggling to adapt Java's KafkaIOIT to work with a large dataset generated by a SyntheticSource. I want to push 100M records through a Kafka topic and verify data correctness and at the same time check the performance of KafkaIO.Write and KafkaIO.Read.
To perform the tests I'm using a Kafka cluster on Kubernetes from the Beam repo (here <https://github.com/apache/beam/tree/master/.test-infra/kubernetes/kafka-cluster> ). To give you an overview of what should happen, first the records are generated in a deterministic way (using hashes of list positions as Random seeds), next they are written to Kafka - this concludes the write pipeline. As for reading and correctness checking - first, the data is read from the topic and after being decoded into String representations, a hashcode of the whole PCollection is calculated (For details, check KafkaIOIT.java). During the testing I ran into several problems: 1. When all the records are read from the Kafka topic, the hash is different each time. 2. Sometimes not all the records are read and the Dataflow task waits for the input indefinitely, occasionally throwing exceptions. I suspect that something is wrong with the Kafka cluster configuration, but unfortunately, I lack experience with this tool to know what could be missing here. I would be very grateful for your help, Kafka config hints or anything else you can add. Thanks and have a good day, Michal -- Michał Walenia Polidea <https://www.polidea.com/> | Software Engineer M: +48 791 432 002 <+48791432002> E: michal.wale...@polidea.com Unique Tech Check out our projects! <https://www.polidea.com/our-work>