This is an automated email from the ASF dual-hosted git repository.
yhu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/beam.git
The following commit(s) were added to refs/heads/master by this push:
new 98cef8bb617 [GCP] [BigQuery] Handle `totalBytesProcessed` `NoneType`
(#27474)
98cef8bb617 is described below
commit 98cef8bb6178c4bc18a0db5bae7eac1c1ea449d6
Author: Dan Hansen <[email protected]>
AuthorDate: Fri Jul 21 09:04:17 2023 -0700
[GCP] [BigQuery] Handle `totalBytesProcessed` `NoneType` (#27474)
* [GCP] [BigQuery] Handle totalBytesProcessed NoneType
* Update CHANGES.md
* lint / whitespace
---------
Co-authored-by: Yi Hu <[email protected]>
---
CHANGES.md | 2 ++
sdks/python/apache_beam/io/gcp/bigquery.py | 26 ++++++++++++++++++++++----
2 files changed, 24 insertions(+), 4 deletions(-)
diff --git a/CHANGES.md b/CHANGES.md
index dff482b6b74..ec1c112ff4d 100644
--- a/CHANGES.md
+++ b/CHANGES.md
@@ -79,6 +79,8 @@
## Bugfixes
+* Fixed DirectRunner bug in Python SDK where GroupByKey gets empty PCollection
and fails when pipeline option `direct_num_workers!=1`.
([#27373](https://github.com/apache/beam/pull/27373))
+* Fixed BigQuery I/O bug when estimating size on queries that utilize
row-level security ([#27474](https://github.com/apache/beam/pull/27474))
* Fixed X (Java/Python) ([#X](https://github.com/apache/beam/issues/X)).
## Known Issues
diff --git a/sdks/python/apache_beam/io/gcp/bigquery.py
b/sdks/python/apache_beam/io/gcp/bigquery.py
index 3fc7bfc3b02..5c1ca4a7d6e 100644
--- a/sdks/python/apache_beam/io/gcp/bigquery.py
+++ b/sdks/python/apache_beam/io/gcp/bigquery.py
@@ -751,8 +751,17 @@ class _CustomBigQuerySource(BoundedSource):
kms_key=self.kms_key,
job_labels=self._get_bq_metadata().add_additional_bq_job_labels(
self.bigquery_job_labels))
- size = int(job.statistics.totalBytesProcessed)
- return size
+
+ if job.statistics.totalBytesProcessed is None:
+ # Some queries may not have access to `totalBytesProcessed` as a
+ # result of row-level security.
+ # > BigQuery hides sensitive statistics on all queries against
+ # > tables with row-level security.
+ # See cloud.google.com/bigquery/docs/managing-row-level-security
+ # and cloud.google.com/bigquery/docs/best-practices-row-level-security
+ return None
+
+ return int(job.statistics.totalBytesProcessed)
else:
# Size estimation is best effort. We return None as we have
# no access to the query that we're running.
@@ -1104,8 +1113,17 @@ class _CustomBigQueryStorageSource(BoundedSource):
kms_key=self.kms_key,
job_labels=self._get_bq_metadata().add_additional_bq_job_labels(
self.bigquery_job_labels))
- size = int(job.statistics.totalBytesProcessed)
- return size
+
+ if job.statistics.totalBytesProcessed is None:
+ # Some queries may not have access to `totalBytesProcessed` as a
+ # result of row-level security
+ # > BigQuery hides sensitive statistics on all queries against
+ # > tables with row-level security.
+ # See cloud.google.com/bigquery/docs/managing-row-level-security
+ # and cloud.google.com/bigquery/docs/best-practices-row-level-security
+ return None
+
+ return int(job.statistics.totalBytesProcessed)
else:
# Size estimation is best effort. We return None as we have
# no access to the query that we're running.