On Wed, Sep 4, 2019 at 4:23 AM Michał Walenia <michal.wale...@polidea.com>
wrote:

> Hi all,
> recently I've been struggling to adapt Java's KafkaIOIT to work with a
> large dataset generated by a SyntheticSource. I want to push 100M records
> through a Kafka topic and verify data correctness and at the same time
> check the performance of KafkaIO.Write and KafkaIO.Read.
>
> To perform the tests I'm using a Kafka cluster on Kubernetes from the Beam
> repo (here
> <https://github.com/apache/beam/tree/master/.test-infra/kubernetes/kafka-cluster>
> ).
>
> To give you an overview of what should happen, first the records are
> generated in a deterministic way (using hashes of list positions as Random
> seeds), next they are written to Kafka - this concludes the write pipeline.
> As for reading and correctness checking - first, the data is read from the
> topic and after being decoded into String representations, a hashcode of
> the whole PCollection is calculated (For details, check KafkaIOIT.java).
>
> During the testing I ran into several problems:
> 1. When all the records are read from the Kafka topic, the hash is
> different each time.
> 2. Sometimes not all the records are read and the Dataflow task waits for
> the input indefinitely, occasionally throwing exceptions.
>
> I suspect that something is wrong with the Kafka cluster configuration,
> but unfortunately, I lack experience with this tool to know what could be
> missing here.
> I would be very grateful for your help, Kafka config hints or anything
> else you can add.
>

Do you see consistent values in Dataflow step element counters ? That might
help you to figure out where the non-determinism is. Also, trying to run
read and write jobs separately might be helpful. I'm not too familiar with
Kafka cluster configurations though.

Thanks,
Cham


> Thanks and have a good day,
>
> Michal
>
> --
>
> Michał Walenia
> Polidea <https://www.polidea.com/> | Software Engineer
>
> M: +48 791 432 002 <+48791432002>
> E: michal.wale...@polidea.com
>
> Unique Tech
> Check out our projects! <https://www.polidea.com/our-work>
>

Reply via email to