On Wed, Sep 4, 2019 at 4:23 AM Michał Walenia <michal.wale...@polidea.com> wrote:
> Hi all, > recently I've been struggling to adapt Java's KafkaIOIT to work with a > large dataset generated by a SyntheticSource. I want to push 100M records > through a Kafka topic and verify data correctness and at the same time > check the performance of KafkaIO.Write and KafkaIO.Read. > > To perform the tests I'm using a Kafka cluster on Kubernetes from the Beam > repo (here > <https://github.com/apache/beam/tree/master/.test-infra/kubernetes/kafka-cluster> > ). > > To give you an overview of what should happen, first the records are > generated in a deterministic way (using hashes of list positions as Random > seeds), next they are written to Kafka - this concludes the write pipeline. > As for reading and correctness checking - first, the data is read from the > topic and after being decoded into String representations, a hashcode of > the whole PCollection is calculated (For details, check KafkaIOIT.java). > > During the testing I ran into several problems: > 1. When all the records are read from the Kafka topic, the hash is > different each time. > 2. Sometimes not all the records are read and the Dataflow task waits for > the input indefinitely, occasionally throwing exceptions. > > I suspect that something is wrong with the Kafka cluster configuration, > but unfortunately, I lack experience with this tool to know what could be > missing here. > I would be very grateful for your help, Kafka config hints or anything > else you can add. > Do you see consistent values in Dataflow step element counters ? That might help you to figure out where the non-determinism is. Also, trying to run read and write jobs separately might be helpful. I'm not too familiar with Kafka cluster configurations though. Thanks, Cham > Thanks and have a good day, > > Michal > > -- > > Michał Walenia > Polidea <https://www.polidea.com/> | Software Engineer > > M: +48 791 432 002 <+48791432002> > E: michal.wale...@polidea.com > > Unique Tech > Check out our projects! <https://www.polidea.com/our-work> >