[GitHub] [beam] satybald commented on pull request #15185: [BEAM-10917] Add support for BigQuery Read API in Python BEAM

GitBox Thu, 26 Aug 2021 11:05:08 -0700


satybald commented on pull request #15185:
URL: https://github.com/apache/beam/pull/15185#issuecomment-906626617



   I would be +1 to revert this PR and work further on cleaning up the public 
API rather than exposing it to the user in the current format. Once the API 
will be exposed it will be hard/impossible to modify and keep backward 
compatible mode.
   
   
   My current concern is that ReadFromBigQuery is already quite messy and 
complicated to use, the current PR just adds to it.
   
   ## What’s the issues with ReadFromBQ from the customer perspective?
   
   I believe as a community we should strive to provide an easy and intuitive 
interface to reading BQ data in BEAM. Thus, we should make API comptable as 
much as possible to minimize any downstream work. Ideally scenario, the user 
just needs to specify a query or table to read without providing other 
parameters that s/he shouldn’t be aware of or care about.
   
   The main issue(besides not able to support query parameters, and exposing 
unnecessary parameters) is type inconsistent parsing between different storage 
formats. For example, this is the current format of parsing DATETIME and 
TIMESTAMP
   
   **Batch with JSON**
   DATETIME, TIMESTAMP - all strings
   
   **Batch export with AVRO**
   DATETIME, TIMESTAMP - all native python datetime objects
   
   **BQ Storage API**
   DATETIME - string, TIMESTAMP - native python object
   
   Ideally as the user when reading from BQ, I should just need to pick which 
method is the best suited to read data from BQ i.e. BATCH, DIRECT_READ. The 
user shouldn’t be exposed with the burden to modify the pipeline further.  
   
   It might be fine if you just start writing a pipeline or have 1-2 pipelines 
but we talk about a large number of pipelines that a typical company might 
have(50-200) the burden of modifying each of them becomes not worth the effort 
and pretty much impossible. 
    
   If adoption of the read BQ storage API is in mind, I would encourage to 
think from the user perspective -to make a switch to a new method as easy as 
possible. In the end, BATCH AVRO method also uses fastavro to read the AVRO 
stream. Thus, those two methods of reading data should be compatible rather 
than having a bunch of warnings for type incompatibility and adding a migration 
burder to the user.  
   
   cc: @vachan-shetty @chamikaramj @pabloem @tvalentyn @kmjung 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] satybald commented on pull request #15185: [BEAM-10917] Add support for BigQuery Read API in Python BEAM

Reply via email to