[GitHub] [beam] Abacn opened a new issue, #27022: [Bug]: Possible data loss in BigtableIO r/w with large data

via GitHub Mon, 05 Jun 2023 13:11:38 -0700


Abacn opened a new issue, #27022:
URL: https://github.com/apache/beam/issues/27022


   ### What happened?
   
   Reported from 
https://github.com/GoogleCloudPlatform/DataflowTemplates/pull/759 
   
   When implementing a load test for BigTableIO, we encountered the following:
   - load tests up to 200mb pass stably.
   - after 5 million records, not all data gets into BigTable, but the pipeline 
logs indicate that all data was written.
   
   Dataflow write pipeline logs say that 10M records were written. 
   However, the read job shows only 1.6M records read.
   
   Using the cbt utility, the cbt -instance <instance id> count <table id> 
command found out that BigTableIO write did not work correctly. Despite the 
fact that the logs say that all 10M records were written, in fact, there were 
exactly as many in the table as the read pipeline processed (1.6M). Some of the 
records processed by the write pipeline did not get into the table.
   
   - Dataflow write pipeline logs - `2023-06-05_03_51_23-9051905355392445711`
   - Dataflow read pipeline logs - `2023-06-05_03_58_18-7016807525741705033`
   
   project: apache-beam-testing
   
   
   ### Issue Priority
   
   Priority: 1 (data loss / total loss of function)
   
   ### Issue Components
   
   - [ ] Component: Python SDK
   - [X] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [X] Component: IO connector
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] Abacn opened a new issue, #27022: [Bug]: Possible data loss in BigtableIO r/w with large data

Reply via email to