damccorm commented on code in PR #32878:
URL: https://github.com/apache/beam/pull/32878#discussion_r1866473483
##########
sdks/python/apache_beam/io/gcp/bigquery.py:
##########
@@ -2550,7 +2577,7 @@ def __init__(
use_at_least_once=False,
with_auto_sharding=False,
num_storage_api_streams=0,
- use_cdc_writes: bool = False,
+ use_cdc_writes: UseCdcWrites = False,
Review Comment:
> Currently, a SDK user would need to know how to structure their Rows to be
ingested using CDC or, if they use Dicts as their data format, their provided
schema should include the row mutation information making it not matching with
the actual BigQuery table schema they want to write to (otherwise the xlang
protocol wouldn't work).
Isn't this still true with your change? The only difference is that they're
now encapsulating the conversion logic in their Write transform instead of
having a step to do it beforehand. Basically, as I understand it, the following
2 blocks are equivalent (and not particularly different to write):
```
pcoll
| WriteToBigQuery(..., use_cdc_writes=my_fn, ...)
```
and:
```
pcoll
| beam.Map(my_fn)
| WriteToBigQuery(..., use_cdc_writes=True, ...)
```
the main difference is that the 2nd is easier to debug IMO. Am I right that
those are equivalent in all cases or am I missing something?
> this change brings parity with BigQueryIO
This is a good reason to consider the change, but not enough to add it IMO
> We have similar overloads in this same file, see
[here](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L972)
and also we have other instances on where the overload is silently added on a
[typeless
argument](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L2571).
To be honest, I don't love that usage either. However, there are a few
differences there:
1) It is not a user facing API
2) `table` is a more general arg than `use_cdc_writes` (which is no longer
descriptive with this change)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]