VardhanThigle opened a new issue, #29832:
URL: https://github.com/apache/beam/issues/29832

   ### What happened?
   
   # Race Condition detected in `SpannerTransactionWriterDoFn.` between the 
setup and teardown phases leading to missing records during spanner migration.
   
   ## Root cause :
   The root cause is `spanner` object being closed immediately after create due 
to a race condition in 
[SpannerAccessor](third_party/java_src/apache_beam4g/srcs/sdks/io/google_cloud_platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerAccessor.java)
 class.
   
   ### Race Condition:
   1. `SpannerAccessor.getOrCreate` get's called during the `@setup` of the 
[SpannerTransactionWriterDoFn](https://source.corp.google.com/piper///depot/google3/third_party/java_src/cloud/teleport/v2/datastream-to-spanner/src/main/java/com/google/cloud/teleport/v2/templates/SpannerTransactionWriterDoFn.java;l=176?q=spanner%20close&ss=piper%2FGoogle%2FPiper:google3%2Fthird_party%2Fjava_src%2Fcloud%2Fteleport%2Fv2%2F)
 and `SpannerAccessor.close` get's called during it's `@teardown`
   2. These functions use a combination of synchronous access and atomic 
Reference counts to share the spanner connections for a given spanner config. 
(In case of SMT we have a single Spanner config and hence a single connection)
   3. When `SpannerAccessor.getOrCreate` provisions a new connection:
   
       3.1. Queries the Accessor from a concurrent hashmap
   
       3.2. If not found, it takes the lock, provisions the connection.
   
       3.3. sets the refcount to 0.
   
       3.4. releases the lock.
    
       3.5. increments the refcount outside the lock.
   
   4. When `SpannerAccessor.close` destroys a connection:
   
        4.1.  atomically decrements the refcount.
   
        4.2 if the refcount is 0.
    
        4.3. it, takes the lock.
   
        4.4 Rechecks the refcount.
   
        4.5. destroys a connection iff the refcount during the second check is 
<= 0.
   
   5. If the logs show anything other than a single connect followed by a 
single close, it's an indication of race condition (since SMT has only one 
spanner connection profile)
   
   
   ### Issue Priority
   
   Priority: 1 (data loss / total loss of function)
   
   ### Issue Components
   
   - [ ] Component: Python SDK
   - [X] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [ ] Component: IO connector
   - [ ] Component: Beam YAML
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to