satybald edited a comment on pull request #15185: URL: https://github.com/apache/beam/pull/15185#issuecomment-906714140
> With that said, if your proposal is that we should provide a single, centrally-maintained compat layer that customers can make use of, that makes sense to me. The DATETIME issue is particularly galling as an end user -- let's start there My proposal here to strive for simplicity and user adoption. I don't see a point having one public ReadFromBigQuery(RFBQ) API that parses `DATETIME` or any other types in three different ways. From the customer perspective this is nightmare given that RFBQ is usually a source of the pipeline. If the user decided to adopt a BQ Storage Read instead of Avro export, s/he might need to test and modify pipeline in different places. > Seems like you've run into issues even before this PR was merged I tested early this PR and have a plan about switching some of the pipelines but after reading further the docs, I found that it would be pretty hard task to migrate an existing pipeline from `EXPORT` to `DIRECT_READ`. Here's the quotes from the documentation that specify the issue: > .. warning:: > DATETIME columns are parsed as strings in the fastavro library. As a > result, such columns will be converted to Python strings instead of native > Python DATETIME type https://github.com/apache/beam/blob/7cad244c2c668cb92d84c5e9b951a0dbffae5017/sdks/python/apache_beam/io/gcp/bigquery.py#L2132 > When using JSON exports, the BigQuery types for DATE, DATETIME, TIME, and > TIMESTAMP will be exported as strings. This behavior is consistent with > BigQuerySource. > When using Avro exports, these fields will be exported as native Python > types (datetime.date, datetime.datetime, datetime.datetime, > and datetime.datetime respectively). Avro exports are recommended. https://github.com/apache/beam/blob/7cad244c2c668cb92d84c5e9b951a0dbffae5017/sdks/python/apache_beam/io/gcp/bigquery.py#L2192 My shallow understanding of this situation, that the issue mainly how `python-bigquery-storage` parses `DATETIME` i.e. [1] rather than `fastavro`. [1] https://github.com/googleapis/python-bigquery-storage/blob/master/google/cloud/bigquery_storage_v1/reader.py#L660-L661 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
