Robert Jany created BEAM-12879:
----------------------------------
Summary: Downloading GCS objects suddenly require
storage.buckets.get permission
Key: BEAM-12879
URL: https://issues.apache.org/jira/browse/BEAM-12879
Project: Beam
Issue Type: Bug
Components: io-py-gcp
Affects Versions: 2.32.0
Reporter: Robert Jany
With PR [https://github.com/apache/beam/pull/14770] downloading GCS objects
requires an additional IAM role `storage.objects.get` to get the project_number
based on the bucket name.
If the service account or user does not have said role the following error will
show:
```
Traceback (most recent call last): File
"/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line
651, in do_work work_executor.execute() File
"/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 179,
in execute op.start() File "dataflow_worker/native_operations.py", line 38, in
dataflow_worker.native_operations.NativeReadOperation.start File
"dataflow_worker/native_operations.py", line 39, in
dataflow_worker.native_operations.NativeReadOperation.start File
"dataflow_worker/native_operations.py", line 44, in
dataflow_worker.native_operations.NativeReadOperation.start File
"dataflow_worker/native_operations.py", line 54, in
dataflow_worker.native_operations.NativeReadOperation.start File
"apache_beam/runners/worker/operations.py", line 353, in
apache_beam.runners.worker.operations.Operation.output File
"apache_beam/runners/worker/operations.py", line 215, in
apache_beam.runners.worker.operations.SingletonConsumerSet.receive File
"apache_beam/runners/worker/operations.py", line 712, in
apache_beam.runners.worker.operations.DoOperation.process File
"apache_beam/runners/worker/operations.py", line 713, in
apache_beam.runners.worker.operations.DoOperation.process File
"apache_beam/runners/common.py", line 1234, in
apache_beam.runners.common.DoFnRunner.process File
"apache_beam/runners/common.py", line 1315, in
apache_beam.runners.common.DoFnRunner._reraise_augmented File
"apache_beam/runners/common.py", line 1232, in
apache_beam.runners.common.DoFnRunner.process File
"apache_beam/runners/common.py", line 571, in
apache_beam.runners.common.SimpleInvoker.invoke_process File
"apache_beam/runners/common.py", line 1368, in
apache_beam.runners.common._OutputProcessor.process_outputs File
"/usr/local/lib/python3.7/site-packages/brain_picker/transformation/cdd_resume.py",
line 112, in process with FileSystems.open(element["gcs_uri"]) as file: File
"/usr/local/lib/python3.7/site-packages/apache_beam/io/filesystems.py", line
244, in open return filesystem.open(path, mime_type, compression_type) File
"/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsfilesystem.py",
line 177, in open return self._path_open(path, 'rb', mime_type,
compression_type) File
"/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsfilesystem.py",
line 138, in _path_open raw_file = gcsio.GcsIO().open(path, mode,
mime_type=mime_type) File
"/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py", line 227,
in open get_project_number=self.get_project_number) File
"/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py", line 585,
in __init__ project_number = self._get_project_number(self._bucket) File
"/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py", line 166,
in get_project_number self.bucket_to_project_number[bucket] =
bucket_metadata.projectNumber AttributeError: 'NoneType' object has no
attribute 'projectNumber' [while running 'read from GCS']
```
The error message does not hint what goes wrong exactly but after some digging
my assumption is that when trying to get the `bucket_metadata ` in
[get_project_number|https://github.com/roger-mike/beam/blob/f0d0dd561a0955afb73cf595a3015a7ca839d5b7/sdks/python/apache_beam/io/gcp/gcsio.py#L161]
we get a a HTTP Error and thus a None (since when catching this error a None
is returned) due to the lack of permissions leading to `bucket_metadata` being
None.
The problem is, that the required permission (`storage.buckets.get`) is only
covered in the predefined role `Storage Admin (roles/storage.admin)` which I
believe shouldn't be necessary in order to access objects from GCS.
Not sure what the solution would look like: We want the metadata incl. the
project number but on the other hand it seems excessive to have to give storage
admin (or having to create custom roles) in order to work with GCS objects. In
any case this situation needs a more elaborate error message.
[get_project_number|https://github.com/roger-mike/beam/blob/f0d0dd561a0955afb73cf595a3015a7ca839d5b7/sdks/python/apache_beam/io/gcp/gcsio.py#L161]
should handle the situation of getting a None from
[get_bucket|https://github.com/roger-mike/beam/blob/f0d0dd561a0955afb73cf595a3015a7ca839d5b7/sdks/python/apache_beam/io/gcp/gcsio.py#L176]
gracefully than failing on an Attribute error as seen above.
Note: This issue will probably not only occur in the Python SDK, but I believe
to have checked the Java implementation for this and at least there we should
be getting a more precise error.
First issue, don't eat me alive :)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)