[
https://issues.apache.org/jira/browse/BEAM-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Graham Polley updated BEAM-6910:
--------------------------------
Description:
When using the BigQuery source with a query in a pipeline, the "processing
location" is not taken into consideration and the pipeline fails.
For example, consider the following which uses `BigQuerySource` to read from
BigQuery using some SQL. The BigQuery dataset and tables are located in
"australia-southeast1". The query is submitted successfully ([Beam works out
the processing location by examining the first table referenced in the query
and sets it
accordingly|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L221]),
but when Beam attempts to poll for the job status after it has been submitted,
it fails because it doesn't set the `location` to be "australia-southeast1",
which is required by BigQuery:
{code:java}
p | 'read' >> beam.io.Read(beam.io.BigQuerySource(use_standard_sql=True,
query='SELECT * from
`a_project_id.dataset_in_australia.table_in_australia`'){code}
{code:java}
HttpNotFoundError: HttpError accessing
<https://www.googleapis.com/bigquery/v2/projects/a_project_id/queries/5ad9cc803baa432290b6cd0203f556d9?alt=json&maxResults=10000>:
response: <{'status': '404', 'content-length': '328', 'x-xss-protection': '1;
mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding':
'chunked', 'vary': 'Origin, X-Origin, Referer', 'server': 'ESF',
'-content-encoding': 'gzip', 'cache-control': 'private', 'date': 'Tue, 26 Mar
2019 03:11:32 GMT', 'x-frame-options': 'SAMEORIGIN', 'alt-svc': 'quic=":443";
ma=2592000; v="46,44,43,39"', 'content-type': 'application/json;
charset=UTF-8'}>, content <{
"error": {
"code": 404,
"message": "Not found: Job a_project_id:5ad9cc803baa432290b6cd0203f556d9",
"errors": [
{
"message": "Not found: Job
a_project_id:5ad9cc803baa432290b6cd0203f556d9",
"domain": "global",
"reason": "notFound"
}
],
"status": "NOT_FOUND"
}
}
{code}
The problem can be seen/found here:
[https://github.com/apache/beam/blob/v2.11.0/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L571]
[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L357]
The location of the job (in this case "australia-southeast1") needs to
set/inferred (or exposed via the API), otherwise its fails.
For reference, Airflow had the same bug/problem:
[https://github.com/apache/airflow/pull/4695]
was:
When using the BigQuery source with a query in a pipeline, the "processing
location" is not taken into consideration and the pipeline fails.
For example, consider the following which uses `BigQuerySource` to read from
BigQuery using some SQL. The BigQuery dataset and tables are located in
"australia-southeast1". The query is submitted successfully ([Beam works out
the processing location by examining the first table referenced in the query
and sets it
accordingly|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L221]),
but when Beam attempts to poll for the job status after it has been submitted,
it fails because it doesn't set the `location` to be "australia-southeast1",
which is required by BigQuery:
{code:java}
p | 'read' >> beam.io.Read(beam.io.BigQuerySource(use_standard_sql=True,
query='SELECT * from
`a_project_id.dataset_in_australia.table_in_australia`'){code}
{code:java}
HttpNotFoundError: HttpError accessing
<https://www.googleapis.com/bigquery/v2/projects/a_project_id/queries/5ad9cc803baa432290b6cd0203f556d9?alt=json&maxResults=10000>:
response: <{'status': '404', 'content-length': '328', 'x-xss-protection': '1;
mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding':
'chunked', 'vary': 'Origin, X-Origin, Referer', 'server': 'ESF',
'-content-encoding': 'gzip', 'cache-control': 'private', 'date': 'Tue, 26 Mar
2019 03:11:32 GMT', 'x-frame-options': 'SAMEORIGIN', 'alt-svc': 'quic=":443";
ma=2592000; v="46,44,43,39"', 'content-type': 'application/json;
charset=UTF-8'}>, content <{
"error": {
"code": 404,
"message": "Not found: Job a_project_id:5ad9cc803baa432290b6cd0203f556d9",
"errors": [
{
"message": "Not found: Job
a_project_id:5ad9cc803baa432290b6cd0203f556d9",
"domain": "global",
"reason": "notFound"
}
],
"status": "NOT_FOUND"
}
}
{code}
The problem can be seen here:
[https://github.com/apache/beam/blob/v2.11.0/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L571]
[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L357]
The location of the job (in this case "australia-southeast1") needs to
set/inferred (or exposed via the API), otherwise its fails.
For reference, Airflow had the same bug/problem:
https://github.com/apache/airflow/pull/4695
> Beam does not consider BigQuery's processing location when getting query
> results
> --------------------------------------------------------------------------------
>
> Key: BEAM-6910
> URL: https://issues.apache.org/jira/browse/BEAM-6910
> Project: Beam
> Issue Type: Bug
> Components: dependencies, runner-dataflow, sdk-py-core
> Affects Versions: 2.11.0
> Environment: Python
> Reporter: Graham Polley
> Priority: Major
>
> When using the BigQuery source with a query in a pipeline, the "processing
> location" is not taken into consideration and the pipeline fails.
> For example, consider the following which uses `BigQuerySource` to read from
> BigQuery using some SQL. The BigQuery dataset and tables are located in
> "australia-southeast1". The query is submitted successfully ([Beam works out
> the processing location by examining the first table referenced in the query
> and sets it
> accordingly|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L221]),
> but when Beam attempts to poll for the job status after it has been
> submitted, it fails because it doesn't set the `location` to be
> "australia-southeast1", which is required by BigQuery:
>
> {code:java}
> p | 'read' >> beam.io.Read(beam.io.BigQuerySource(use_standard_sql=True,
> query='SELECT * from
> `a_project_id.dataset_in_australia.table_in_australia`'){code}
>
> {code:java}
> HttpNotFoundError: HttpError accessing
> <https://www.googleapis.com/bigquery/v2/projects/a_project_id/queries/5ad9cc803baa432290b6cd0203f556d9?alt=json&maxResults=10000>:
> response: <{'status': '404', 'content-length': '328', 'x-xss-protection':
> '1; mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding':
> 'chunked', 'vary': 'Origin, X-Origin, Referer', 'server': 'ESF',
> '-content-encoding': 'gzip', 'cache-control': 'private', 'date': 'Tue, 26 Mar
> 2019 03:11:32 GMT', 'x-frame-options': 'SAMEORIGIN', 'alt-svc': 'quic=":443";
> ma=2592000; v="46,44,43,39"', 'content-type': 'application/json;
> charset=UTF-8'}>, content <{
> "error": {
> "code": 404,
> "message": "Not found: Job a_project_id:5ad9cc803baa432290b6cd0203f556d9",
> "errors": [
> {
> "message": "Not found: Job
> a_project_id:5ad9cc803baa432290b6cd0203f556d9",
> "domain": "global",
> "reason": "notFound"
> }
> ],
> "status": "NOT_FOUND"
> }
> }
> {code}
>
> The problem can be seen/found here:
> [https://github.com/apache/beam/blob/v2.11.0/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L571]
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L357]
> The location of the job (in this case "australia-southeast1") needs to
> set/inferred (or exposed via the API), otherwise its fails.
> For reference, Airflow had the same bug/problem:
> [https://github.com/apache/airflow/pull/4695]
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)