[
https://issues.apache.org/jira/browse/BEAM-8528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kamil Wasilewski updated BEAM-8528:
-----------------------------------
Description:
Refer to https://github.com/apache/beam/pull/9772 for more information and the
context of this ticket.
The following exception is being raised when _ReadFromBigQuery_ PTransform is
used on DirectRunner in Python SDK:
{code:java}
File "/home/Kamil/projects/beam/sdks/python/apache_beam/io/gcp/bigquery.py",
line 639, in get_range_tracker
raise NotImplementedError('BigQuery source must be split before being read')
NotImplementedError: BigQuery source must be split before being read
{code}
The direct cause is _get_range_tracker_ and _read_ methods aren't implemented
in __BigQuerySource_. This is purposeful — the runner is expected to call
_split_ instead. The Java implementation works the same way:
[link|https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQuerySourceBase.java]
It seems that DataflowRunner and Flink are able to catch these exceptions
somehow, while DirectRunner is not.
was:
{code:java}
File "/home/Kamil/projects/beam/sdks/python/apache_beam/io/gcp/bigquery.py",
line 639, in get_range_tracker
raise NotImplementedError('BigQuery source must be split before being read')
NotImplementedError: BigQuery source must be split before being read
{code}
_get_range_tracker_ and _read_ methods aren't implemented in __BigQuerySource_.
This is purposeful — the runner is expected to call _split_ instead. The Java
implementation works the same way:
[link|https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQuerySourceBase.java]
It seems that DataflowRunner and Flink are able to catch these exceptions
somehow, while DirectRunner is not.
> BigQuery bounded source does not work on DirectRunner
> -----------------------------------------------------
>
> Key: BEAM-8528
> URL: https://issues.apache.org/jira/browse/BEAM-8528
> Project: Beam
> Issue Type: Bug
> Components: sdk-py-core
> Reporter: Kamil Wasilewski
> Priority: Major
>
> Refer to https://github.com/apache/beam/pull/9772 for more information and
> the context of this ticket.
> The following exception is being raised when _ReadFromBigQuery_ PTransform is
> used on DirectRunner in Python SDK:
> {code:java}
> File
> "/home/Kamil/projects/beam/sdks/python/apache_beam/io/gcp/bigquery.py", line
> 639, in get_range_tracker
> raise NotImplementedError('BigQuery source must be split before being
> read')
> NotImplementedError: BigQuery source must be split before being read
> {code}
> The direct cause is _get_range_tracker_ and _read_ methods aren't
> implemented in __BigQuerySource_. This is purposeful — the runner is expected
> to call _split_ instead. The Java implementation works the same way:
> [link|https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQuerySourceBase.java]
> It seems that DataflowRunner and Flink are able to catch these exceptions
> somehow, while DirectRunner is not.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)