[
https://issues.apache.org/jira/browse/BEAM-10647?focusedWorklogId=467579&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-467579
]
ASF GitHub Bot logged work on BEAM-10647:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 06/Aug/20 21:31
Start Date: 06/Aug/20 21:31
Worklog Time Spent: 10m
Work Description: galuszkak commented on a change in pull request #12480:
URL: https://github.com/apache/beam/pull/12480#discussion_r466696088
##########
File path: sdks/python/apache_beam/io/gcp/bigquery_tools.py
##########
@@ -318,15 +319,20 @@ def get_query_location(self, project_id, query,
use_legacy_sql):
referenced_tables = response.statistics.query.referencedTables
if referenced_tables: # Guards against both non-empty and non-None
Review comment:
@apilloud comment suggest that is also guarding against None value. If
that's true, dropping this if, would break code `for loop` in case
referenced_tables is actually None.
Therefore my question is, are you sure we want to remove that if?
##########
File path: sdks/python/apache_beam/io/gcp/bigquery_tools.py
##########
@@ -318,15 +319,20 @@ def get_query_location(self, project_id, query,
use_legacy_sql):
referenced_tables = response.statistics.query.referencedTables
if referenced_tables: # Guards against both non-empty and non-None
- table = referenced_tables[0]
- location = self.get_table_location(
- table.projectId, table.datasetId, table.tableId)
- _LOGGER.info(
- "Using location %r from table %r referenced by query %s",
- location,
- table,
- query)
- return location
+ for table in referenced_tables:
+ try:
+ location = self.get_table_location(
+ table.projectId, table.datasetId, table.tableId)
+ except HttpForbiddenError:
+ # Permission access for table (i.e. from authorized_view),
+ # try next one
+ continue
+ _LOGGER.info(
+ "Using location %r from table %r referenced by query %s",
+ location,
+ table,
+ query)
+ return location
_LOGGER.debug("Query %s does not reference any tables.", query)
Review comment:
@apilloud will change it. Thanks for pointing this out.
##########
File path: sdks/python/apache_beam/io/gcp/bigquery_tools_test.py
##########
@@ -50,6 +49,15 @@
from apache_beam.io.gcp.internal.clients import bigquery
from apache_beam.options.pipeline_options import PipelineOptions
+# Protect against environments where bigquery library is not available.
Review comment:
@apilloud this is copy from bigquery_test.py I assumed we want this
consistent between files.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 467579)
Time Spent: 40m (was: 0.5h)
> BigQueryIO BigQueryWrapper.get_query_location can end up in permission issue
> ----------------------------------------------------------------------------
>
> Key: BEAM-10647
> URL: https://issues.apache.org/jira/browse/BEAM-10647
> Project: Beam
> Issue Type: Bug
> Components: io-py-gcp
> Reporter: Kamil Gałuszka
> Priority: P2
> Time Spent: 40m
> Remaining Estimate: 0h
>
> This bug is not deterministic because of Google BigQuery API, but let me try
> to describe the problem, as we were hunting down this for whole 2 days.
> So imagine that you have one dataset with table XYZ. You added to that
> dataset Authorized View that is referencing table in project that you don't
> have access to. Only via Authorized View you can query that table.
> Unfortunately when executing method
> {code:java}
> `get_query_location`{code}
> To determine location where to write temp_dataset:
> {code:java}
> referenced_tables = response.statistics.query.referencedTables
> if referenced_tables: # Guards against both non-empty and non-None
> table = referenced_tables[0]
> location = self.get_table_location( table.projectId, table.datasetId,
> table.tableId)
> {code}
> The issue with that code is that, referenced_tables, will not reference
> where view is but it will give you information about underlying table in that
> authorised view.
> So if it would be first in your result (and implementation of
> get_query_location only cares about first result), you will get permission
> error, that you cannot retrieve dataset which is correct! User has access to
> Authorised view, that he can query, but not to underlying table.
> Therefore, what should happen, implementation should be changed, to loops
> through tables until it finds location.
> Mainly my point boils down to:
> * You can get table, that you don't have access and it's dataset, but you
> can query it via Authorised Views.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)