Re: [PR] Enable configuration of a CDC mutation info Callable for CDC Writes into BigQuery [beam]

via GitHub Mon, 02 Dec 2024 11:16:07 -0800


damccorm commented on code in PR #32878:
URL: https://github.com/apache/beam/pull/32878#discussion_r1866473483



##########
sdks/python/apache_beam/io/gcp/bigquery.py:
##########
@@ -2550,7 +2577,7 @@ def __init__(
       use_at_least_once=False,
       with_auto_sharding=False,
       num_storage_api_streams=0,
-      use_cdc_writes: bool = False,
+      use_cdc_writes: UseCdcWrites = False,

Review Comment:
   > Currently, a SDK user would need to know how to structure their Rows to be 
ingested using CDC or, if they use Dicts as their data format, their provided 
schema should include the row mutation information making it not matching with 
the actual BigQuery table schema they want to write to (otherwise the xlang 
protocol wouldn't work).
   
   Isn't this still true with your change? The only difference is that they're 
now encapsulating the conversion logic in their Write transform instead of 
having a step to do it beforehand. Basically, as I understand it, the following 
2 blocks are equivalent (and not particularly different to write):
   
   ```
   pcoll
   | WriteToBigQuery(..., use_cdc_writes=my_fn, ...)
   ```
   
   and:
   
   ```
   pcoll
   | beam.Map(my_fn)
   | WriteToBigQuery(..., use_cdc_writes=True, ...)
   ```
   
   the main difference is that the 2nd is easier to debug IMO. Am I right that 
those are equivalent in all cases or am I missing something?
   
   > this change brings parity with BigQueryIO
   
   This is a good reason to consider the change, but not enough to add it IMO
   
   > We have similar overloads in this same file, see 
[here](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L972)
 and also we have other instances on where the overload is silently added on a 
[typeless 
argument](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L2571).
   
   To be honest, I don't love that usage either. However, there are a few 
differences there:
   
   1) It is not a user facing API
   2) `table` is a more general arg than `use_cdc_writes` (which is no longer 
descriptive with this change)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Enable configuration of a CDC mutation info Callable for CDC Writes into BigQuery [beam]

Reply via email to