[
https://issues.apache.org/jira/browse/BEAM-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Daniel Halperin updated BEAM-48:
--------------------------------
Assignee: Pei He
> BigQueryIO.Read reimplemented as BoundedSource
> ----------------------------------------------
>
> Key: BEAM-48
> URL: https://issues.apache.org/jira/browse/BEAM-48
> Project: Beam
> Issue Type: New Feature
> Components: sdk-java-gcp
> Reporter: Daniel Halperin
> Assignee: Pei He
>
> BigQueryIO.Read is currently implemented in a hacky way: the
> DirectPipelineRunner streams all rows in the table or query result directly
> using the JSON API, in a single-threaded manner.
> In contrast, the DataflowPipelineRunner uses an entirely different code path
> implemented in the Google Cloud Dataflow service. (A BigQuery export job to
> GCS, followed by a parallel read from GCS).
> We need to reimplement BigQueryIO as a BoundedSource in order to support
> other runners in a scalable way.
> I additionally suggest that we revisit the design of the BigQueryIO source in
> the process. A short list:
> * Do not use TableRow as the default value for rows. It could be Map<String,
> Object> with well-defined types, for example, or an Avro GenericRecord.
> Dropping TableRow will get around a variety of issues with types, fields
> named 'f', etc., and it will also reduce confusion as we use TableRow objects
> differently than usual (for good reason).
> * We could also directly add support for a RowParser to a user's POJO.
> * We should expose TableSchema as a side output from the BigQueryIO.Read.
> * Our builders for BigQueryIO.Read are useful and we should keep them. Where
> possible we should also allow users to provide the JSON objects that
> configure the underlying intermediate tables, query export, etc. This would
> let users directly control result flattening, location of intermediate
> tables, table decorators, etc., and also optimistically let users take
> advantage of some new BigQuery features without code changes.
> * We could use switch between whether we use a BigQuery export + parallel
> scan vs API read based on factors such as the size of the table at pipeline
> construction time.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)