[jira] [Created] (BEAM-8207) KafkaIOITs generate different hashes each run, sometimes dropping records

Michal Walenia (Jira) Wed, 11 Sep 2019 04:15:40 -0700

Michal Walenia created BEAM-8207:
------------------------------------

             Summary: KafkaIOITs generate different hashes each run, sometimes 
dropping records
                 Key: BEAM-8207
                 URL: https://issues.apache.org/jira/browse/BEAM-8207
             Project: Beam
          Issue Type: Bug
          Components: io-java-kafka, testing
            Reporter: Michal Walenia



While working to adapt Java's KafkaIOIT to work with a large dataset generated 
by a SyntheticSource I encountered a problem. I want to push 100M records 
through a Kafka topic, verify data correctness and at the same time check the 
performance of KafkaIO.Write and KafkaIO.Read.
 
To perform the tests I'm using a Kafka cluster on Kubernetes from the Beam repo 
([here|https://github.com/apache/beam/tree/master/.test-infra/kubernetes/kafka-cluster]).
 
The expected result would be that first the records are generated in a 
deterministic way (using hashes of list positions as Random seeds), next they 
are written to Kafka - this concludes the write pipeline.
As for reading and correctness checking - first, the data is read from the 
topic and after being decoded into String representations, a hashcode of the 
whole PCollection is calculated (For details, check KafkaIOIT.java).
 
During the testing I ran into several problems:
1. When all the records are read from the Kafka topic, the hash is different 
each time.
2. Sometimes not all the records are read and the Dataflow task waits for the 
input indefinitely, occasionally throwing exceptions.
 
I believe there are two possible causes of this behavior:
 
either there is something wrong with the Kafka cluster configuration
or KafkaIO behaves erratically on high data volumes, duplicating and/or 
dropping records.
Second option seems troubling and I would be grateful for help with the first.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (BEAM-8207) KafkaIOITs generate different hashes each run, sometimes dropping records

Reply via email to