[jira] [Updated] (BEAM-48) BigQueryIO.Read reimplemented as BoundedSource

Daniel Halperin (JIRA) Wed, 20 Apr 2016 10:36:58 -0700

     [ 
https://issues.apache.org/jira/browse/BEAM-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Daniel Halperin updated BEAM-48:
--------------------------------
    Assignee: Pei He

> BigQueryIO.Read reimplemented as BoundedSource
> ----------------------------------------------
>
>                 Key: BEAM-48
>                 URL: https://issues.apache.org/jira/browse/BEAM-48
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-gcp
>            Reporter: Daniel Halperin
>            Assignee: Pei He
>
> BigQueryIO.Read is currently implemented in a hacky way: the 
> DirectPipelineRunner streams all rows in the table or query result directly 
> using the JSON API, in a single-threaded manner.
> In contrast, the DataflowPipelineRunner uses an entirely different code path 
> implemented in the Google Cloud Dataflow service. (A BigQuery export job to 
> GCS, followed by a parallel read from GCS).
> We need to reimplement BigQueryIO as a BoundedSource in order to support 
> other runners in a scalable way.
> I additionally suggest that we revisit the design of the BigQueryIO source in 
> the process. A short list:
> * Do not use TableRow as the default value for rows. It could be Map<String, 
> Object> with well-defined types, for example, or an Avro GenericRecord. 
> Dropping TableRow will get around a variety of issues with types, fields 
> named 'f', etc., and it will also reduce confusion as we use TableRow objects 
> differently than usual (for good reason).
> * We could also directly add support for a RowParser to a user's POJO.
> * We should expose TableSchema as a side output from the BigQueryIO.Read.
> * Our builders for BigQueryIO.Read are useful and we should keep them. Where 
> possible we should also allow users to provide the JSON objects that 
> configure the underlying intermediate tables, query export, etc. This would 
> let users directly control result flattening, location of intermediate 
> tables, table decorators, etc., and also optimistically let users take 
> advantage of some new BigQuery features without code changes.
> * We could use switch between whether we use a BigQuery export + parallel 
> scan vs API read based on factors such as the size of the table at pipeline 
> construction time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (BEAM-48) BigQueryIO.Read reimplemented as BoundedSource

Reply via email to