Re: Reading from BigQuery on portable runners in Python SDK

Chamikara Jayalath Tue, 01 Oct 2019 17:21:47 -0700

Yes this is something we wanted to do for sometime but could not prioritize
due to other high priority work. JIRA is
https://issues.apache.org/jira/browse/BEAM-1440.


Note that BigQuery sources have many moving parts and Java BigQuery source
[1] is one of the most complicated sources we have. So I suggest
following the Java implementation closely when implementing the Python
version.

Another option will be to wait till we have Splittable DoFn for Python
bounded sources which is expected to be available soon but this does not
necessarily have to be the case since we'll be providing converters from
BounndedSources to SDF (but pure SDF versions probably will be better in
some regards).

Thanks,
Cham


[1]
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L546

On Tue, Oct 1, 2019 at 8:48 AM Ahmet Altay <al...@google.com> wrote:

> +Chamikara Jayalath <chamik...@google.com> and +Pablo Estrada
> <pabl...@google.com> might have ideas related to this.
>
> On Tue, Oct 1, 2019 at 2:39 AM Kamil Wasilewski <
> kamil.wasilew...@polidea.com> wrote:
>
>> If anyone is interested, here is a link to my code:
>> https://github.com/kamilwu/beam/tree/bounded-source-for-bq
>>
>> On Tue, Oct 1, 2019 at 11:17 AM Kamil Wasilewski <
>> kamil.wasilew...@polidea.com> wrote:
>>
>>> Hi all,
>>>
>>> At the moment, we have a BigQuery native source for Python SDK, which
>>> can be used only by Dataflow runner. Consequently, it doesn't work on
>>> portable runners, such as Flink.
>>>
>>> Recently I have written a prototypical source which implements
>>> iobase.BoundedSource, so that other runners can read from BigQuery as well.
>>> It works the same way as in Java SDK [1], which means that it exports
>>> BigQuery table to JSON and returns TextSource objects in the split() call.
>>> However, it has the following problems:
>>> - it doesn't work on Direct runner,
>>>
>>
> I believe DirectRunner already have an implementation for reading from BQ.
>
>
>> - its API is highly experimental.
>>>
>>
> Which API is highly experimental?
>
>
>>
>>> This is where my question begins. What should we do in order to provide
>>> support for reading from BigQuery on other runners than Dataflow? Do you
>>> think it's fine to continue working on the source I described? Or maybe it
>>> should be done in an entirely different way (not by exporting tables to
>>> JSON)?
>>>
>>> Thanks,
>>> Kamil
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQuerySourceBase.java
>>>
>>

Re: Reading from BigQuery on portable runners in Python SDK

Reply via email to