Thanks for the reply Cham, I'll check the counters and see if they are consistent. The read and write pipelines are already running separately.
I created a question on Stack <https://stackoverflow.com/questions/57901708/kafka-cluster-loses-or-duplicates-messages>, feel free to answer there if you want your input to be easier to search and access in the future. Michal On Wed, Sep 4, 2019 at 7:04 PM Chamikara Jayalath <chamik...@google.com> wrote: > > > On Wed, Sep 4, 2019 at 4:23 AM Michał Walenia <michal.wale...@polidea.com> > wrote: > >> Hi all, >> recently I've been struggling to adapt Java's KafkaIOIT to work with a >> large dataset generated by a SyntheticSource. I want to push 100M records >> through a Kafka topic and verify data correctness and at the same time >> check the performance of KafkaIO.Write and KafkaIO.Read. >> >> To perform the tests I'm using a Kafka cluster on Kubernetes from the >> Beam repo (here >> <https://github.com/apache/beam/tree/master/.test-infra/kubernetes/kafka-cluster> >> ). >> >> To give you an overview of what should happen, first the records are >> generated in a deterministic way (using hashes of list positions as Random >> seeds), next they are written to Kafka - this concludes the write pipeline. >> As for reading and correctness checking - first, the data is read from >> the topic and after being decoded into String representations, a hashcode >> of the whole PCollection is calculated (For details, check KafkaIOIT.java). >> >> During the testing I ran into several problems: >> 1. When all the records are read from the Kafka topic, the hash is >> different each time. >> 2. Sometimes not all the records are read and the Dataflow task waits for >> the input indefinitely, occasionally throwing exceptions. >> >> I suspect that something is wrong with the Kafka cluster configuration, >> but unfortunately, I lack experience with this tool to know what could be >> missing here. >> I would be very grateful for your help, Kafka config hints or anything >> else you can add. >> > > Do you see consistent values in Dataflow step element counters ? That > might help you to figure out where the non-determinism is. Also, trying to > run read and write jobs separately might be helpful. I'm not too familiar > with Kafka cluster configurations though. > > Thanks, > Cham > > >> Thanks and have a good day, >> >> Michal >> >> -- >> >> Michał Walenia >> Polidea <https://www.polidea.com/> | Software Engineer >> >> M: +48 791 432 002 <+48791432002> >> E: michal.wale...@polidea.com >> >> Unique Tech >> Check out our projects! <https://www.polidea.com/our-work> >> > -- Michał Walenia Polidea <https://www.polidea.com/> | Software Engineer M: +48 791 432 002 <+48791432002> E: michal.wale...@polidea.com Unique Tech Check out our projects! <https://www.polidea.com/our-work>