[ 
https://issues.apache.org/jira/browse/BEAM-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969327#comment-16969327
 ] 

Michal Walenia commented on BEAM-8207:
--------------------------------------

[~aromanenko] Yes, sorry for not updating the issue. It turns out that the 
issue appeared when the topic had multiple partitions and a replication factor 
greater than 1. There must be a problem with cluster performance when 
synchronising the replicas. I set the replication and partitioning to 1, which 
solved the issue. In the future I'll maybe look into cluster parameters and 
settings that would allow for multiple partitions.

My solution is enough for testing purposes.

> KafkaIOITs generate different hashes each run, sometimes dropping records
> -------------------------------------------------------------------------
>
>                 Key: BEAM-8207
>                 URL: https://issues.apache.org/jira/browse/BEAM-8207
>             Project: Beam
>          Issue Type: Bug
>          Components: io-java-kafka, testing
>            Reporter: Michal Walenia
>            Priority: Major
>
> While working to adapt Java's KafkaIOIT to work with a large dataset 
> generated by a SyntheticSource I encountered a problem. I want to push 100M 
> records through a Kafka topic, verify data correctness and at the same time 
> check the performance of KafkaIO.Write and KafkaIO.Read.
>  
> To perform the tests I'm using a Kafka cluster on Kubernetes from the Beam 
> repo 
> ([here|https://github.com/apache/beam/tree/master/.test-infra/kubernetes/kafka-cluster]).
>  
> The expected result would be that first the records are generated in a 
> deterministic way (using hashes of list positions as Random seeds), next they 
> are written to Kafka - this concludes the write pipeline.
> As for reading and correctness checking - first, the data is read from the 
> topic and after being decoded into String representations, a hashcode of the 
> whole PCollection is calculated (For details, check KafkaIOIT.java).
>  
> During the testing I ran into several problems:
> 1. When all the records are read from the Kafka topic, the hash is different 
> each time.
> 2. Sometimes not all the records are read and the Dataflow task waits for the 
> input indefinitely, occasionally throwing exceptions.
>  
> I believe there are two possible causes of this behavior:
>  
> either there is something wrong with the Kafka cluster configuration
> or KafkaIO behaves erratically on high data volumes, duplicating and/or 
> dropping records.
> Second option seems troubling and I would be grateful for help with the first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to