Thanks for the reply Cham, I'll check the counters and see if they are
consistent. The read and write pipelines are already running separately.

I created a question on Stack
<https://stackoverflow.com/questions/57901708/kafka-cluster-loses-or-duplicates-messages>,
feel free to answer there if you want your input to be easier to search and
access in the future.

Michal


On Wed, Sep 4, 2019 at 7:04 PM Chamikara Jayalath <chamik...@google.com>
wrote:

>
>
> On Wed, Sep 4, 2019 at 4:23 AM Michał Walenia <michal.wale...@polidea.com>
> wrote:
>
>> Hi all,
>> recently I've been struggling to adapt Java's KafkaIOIT to work with a
>> large dataset generated by a SyntheticSource. I want to push 100M records
>> through a Kafka topic and verify data correctness and at the same time
>> check the performance of KafkaIO.Write and KafkaIO.Read.
>>
>> To perform the tests I'm using a Kafka cluster on Kubernetes from the
>> Beam repo (here
>> <https://github.com/apache/beam/tree/master/.test-infra/kubernetes/kafka-cluster>
>> ).
>>
>> To give you an overview of what should happen, first the records are
>> generated in a deterministic way (using hashes of list positions as Random
>> seeds), next they are written to Kafka - this concludes the write pipeline.
>> As for reading and correctness checking - first, the data is read from
>> the topic and after being decoded into String representations, a hashcode
>> of the whole PCollection is calculated (For details, check KafkaIOIT.java).
>>
>> During the testing I ran into several problems:
>> 1. When all the records are read from the Kafka topic, the hash is
>> different each time.
>> 2. Sometimes not all the records are read and the Dataflow task waits for
>> the input indefinitely, occasionally throwing exceptions.
>>
>> I suspect that something is wrong with the Kafka cluster configuration,
>> but unfortunately, I lack experience with this tool to know what could be
>> missing here.
>> I would be very grateful for your help, Kafka config hints or anything
>> else you can add.
>>
>
> Do you see consistent values in Dataflow step element counters ? That
> might help you to figure out where the non-determinism is. Also, trying to
> run read and write jobs separately might be helpful. I'm not too familiar
> with Kafka cluster configurations though.
>
> Thanks,
> Cham
>
>
>> Thanks and have a good day,
>>
>> Michal
>>
>> --
>>
>> Michał Walenia
>> Polidea <https://www.polidea.com/> | Software Engineer
>>
>> M: +48 791 432 002 <+48791432002>
>> E: michal.wale...@polidea.com
>>
>> Unique Tech
>> Check out our projects! <https://www.polidea.com/our-work>
>>
>

-- 

Michał Walenia
Polidea <https://www.polidea.com/> | Software Engineer

M: +48 791 432 002 <+48791432002>
E: michal.wale...@polidea.com

Unique Tech
Check out our projects! <https://www.polidea.com/our-work>

Reply via email to