ahmedabu98 opened a new issue, #28632:
URL: https://github.com/apache/beam/issues/28632

   ### What happened?
   
   When users don't explicitly set a timestamp on their records, the Python BT 
client defaults the timestamp to -1 (Bigtable system time at ingestion). The 
connector mishandles these rows by not sending over these timestamps and 
instead dropping them 
[here](https://github.com/apache/beam/blob/f635ade0e70a8c347d977f6dbd425ba0c4df37d0/sdks/python/apache_beam/io/gcp/bigtableio.py#L258-L259).
 When the records get to the underlying Java IO, it doesn't see any timestamp. 
Unlike the Python client, the Java BT client defaults timestamps to 0 (epoch 
time). 
   
   The result is instead of attaching the current timestamp to cells, we attach 
epoch time for each of them.
   
   This can affect users in two ways:
   1. Users can set a garbage collection policy that cleans up old records in 
their table. These records with unset timestamps will show up as really old 
(1970-1-1) and will be garbage collected
   2. Bigtable keeps the history of a cell in a table. When users write to a 
cell multiple times, this bug will cause the cell history to be overwritten 
because the same timestamp (epoch time) is used each time.
   
   ### Issue Priority
   
   Priority: 1 (data loss / total loss of function)
   
   ### Issue Components
   
   - [X] Component: Python SDK
   - [ ] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [X] Component: IO connector
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to