[
https://issues.apache.org/jira/browse/BEAM-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexey Romanenko resolved BEAM-8207.
------------------------------------
Fix Version/s: Not applicable
Resolution: Not A Problem
> KafkaIOITs generate different hashes each run, sometimes dropping records
> -------------------------------------------------------------------------
>
> Key: BEAM-8207
> URL: https://issues.apache.org/jira/browse/BEAM-8207
> Project: Beam
> Issue Type: Bug
> Components: io-java-kafka, testing
> Reporter: Michal Walenia
> Priority: Major
> Fix For: Not applicable
>
>
> While working to adapt Java's KafkaIOIT to work with a large dataset
> generated by a SyntheticSource I encountered a problem. I want to push 100M
> records through a Kafka topic, verify data correctness and at the same time
> check the performance of KafkaIO.Write and KafkaIO.Read.
>
> To perform the tests I'm using a Kafka cluster on Kubernetes from the Beam
> repo
> ([here|https://github.com/apache/beam/tree/master/.test-infra/kubernetes/kafka-cluster]).
>
> The expected result would be that first the records are generated in a
> deterministic way (using hashes of list positions as Random seeds), next they
> are written to Kafka - this concludes the write pipeline.
> As for reading and correctness checking - first, the data is read from the
> topic and after being decoded into String representations, a hashcode of the
> whole PCollection is calculated (For details, check KafkaIOIT.java).
>
> During the testing I ran into several problems:
> 1. When all the records are read from the Kafka topic, the hash is different
> each time.
> 2. Sometimes not all the records are read and the Dataflow task waits for the
> input indefinitely, occasionally throwing exceptions.
>
> I believe there are two possible causes of this behavior:
>
> either there is something wrong with the Kafka cluster configuration
> or KafkaIO behaves erratically on high data volumes, duplicating and/or
> dropping records.
> Second option seems troubling and I would be grateful for help with the first.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)