satybald commented on pull request #15185:
URL: https://github.com/apache/beam/pull/15185#issuecomment-906626617
I would be +1 to revert this PR and work further on cleaning up the public
API rather than exposing it to the user in the current format. Once the API
will be exposed it will be hard/impossible to modify and keep backward
compatible mode.
My current concern is that ReadFromBigQuery is already quite messy and
complicated to use, the current PR just adds to it.
## What’s the issues with ReadFromBQ from the customer perspective?
I believe as a community we should strive to provide an easy and intuitive
interface to reading BQ data in BEAM. Thus, we should make API comptable as
much as possible to minimize any downstream work. Ideally scenario, the user
just needs to specify a query or table to read without providing other
parameters that s/he shouldn’t be aware of or care about.
The main issue(besides not able to support query parameters, and exposing
unnecessary parameters) is type inconsistent parsing between different storage
formats. For example, this is the current format of parsing DATETIME and
TIMESTAMP
**Batch with JSON**
DATETIME, TIMESTAMP - all strings
**Batch export with AVRO**
DATETIME, TIMESTAMP - all native python datetime objects
**BQ Storage API**
DATETIME - string, TIMESTAMP - native python object
Ideally as the user when reading from BQ, I should just need to pick which
method is the best suited to read data from BQ i.e. BATCH, DIRECT_READ. The
user shouldn’t be exposed with the burden to modify the pipeline further.
It might be fine if you just start writing a pipeline or have 1-2 pipelines
but we talk about a large number of pipelines that a typical company might
have(50-200) the burden of modifying each of them becomes not worth the effort
and pretty much impossible.
If adoption of the read BQ storage API is in mind, I would encourage to
think from the user perspective -to make a switch to a new method as easy as
possible. In the end, BATCH AVRO method also uses fastavro to read the AVRO
stream. Thus, those two methods of reading data should be compatible rather
than having a bunch of warnings for type incompatibility and adding a migration
burder to the user.
cc: @vachan-shetty @chamikaramj @pabloem @tvalentyn @kmjung
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]