[
https://issues.apache.org/jira/browse/BEAM-12879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17466200#comment-17466200
]
Ahmet Altay commented on BEAM-12879:
------------------------------------
[~tvalentyn] - Assigned this to find an owner. I do not think it is very
urgent, but is definitely a getting started and ease of use issue. It is also
possible a simple fix by ignoring the errors for unauthenticated users.
> Downloading GCS objects suddenly require storage.buckets.get permission
> -----------------------------------------------------------------------
>
> Key: BEAM-12879
> URL: https://issues.apache.org/jira/browse/BEAM-12879
> Project: Beam
> Issue Type: Bug
> Components: io-py-gcp
> Affects Versions: 2.32.0
> Reporter: Robert Jany
> Assignee: Valentyn Tymofieiev
> Priority: P2
>
> With PR [https://github.com/apache/beam/pull/14770] downloading GCS objects
> requires an additional IAM role `storage.objects.get` to get the
> project_number based on the bucket name.
> If the service account or user does not have said role the following error
> will show:
> {code:python}
> Traceback (most recent call last):
> File
> "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line
> 651, in do_work
> work_executor.execute()
> File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py",
> line 179, in execute
> op.start()
> File "dataflow_worker/native_operations.py", line 38, in
> dataflow_worker.native_operations.NativeReadOperation.start
> File "dataflow_worker/native_operations.py", line 39, in
> dataflow_worker.native_operations.NativeReadOperation.start
> File "dataflow_worker/native_operations.py", line 44, in
> dataflow_worker.native_operations.NativeReadOperation.start
> File "dataflow_worker/native_operations.py", line 54, in
> dataflow_worker.native_operations.NativeReadOperation.start
> File "apache_beam/runners/worker/operations.py", line 353, in
> apache_beam.runners.worker.operations.Operation.output
> File "apache_beam/runners/worker/operations.py", line 215, in
> apache_beam.runners.worker.operations.SingletonConsumerSet.receive
> File "apache_beam/runners/worker/operations.py", line 712, in
> apache_beam.runners.worker.operations.DoOperation.process
> File "apache_beam/runners/worker/operations.py", line 713, in
> apache_beam.runners.worker.operations.DoOperation.process
> File "apache_beam/runners/common.py", line 1234, in
> apache_beam.runners.common.DoFnRunner.process
> File "apache_beam/runners/common.py", line 1315, in
> apache_beam.runners.common.DoFnRunner._reraise_augmented
> File "apache_beam/runners/common.py", line 1232, in
> apache_beam.runners.common.DoFnRunner.process
> File "apache_beam/runners/common.py", line 571, in
> apache_beam.runners.common.SimpleInvoker.invoke_process
> File "apache_beam/runners/common.py", line 1368, in
> apache_beam.runners.common._OutputProcessor.process_outputs
> File "/usr/local/lib/python3.7/site-packages/xyz/package/file.py", line
> 112, in process
> with FileSystems.open(element["gcs_uri"]) as file:
> File
> "/usr/local/lib/python3.7/site-packages/apache_beam/io/filesystems.py", line
> 244, in open
> return filesystem.open(path, mime_type, compression_type)
> File
> "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsfilesystem.py",
> line 177, in open
> return self._path_open(path, 'rb', mime_type, compression_type)
> File
> "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsfilesystem.py",
> line 138, in _path_open
> raw_file = gcsio.GcsIO().open(path, mode, mime_type=mime_type)
> File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py",
> line 227, in open
> get_project_number=self.get_project_number)
> File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py",
> line 585, in __init__
> project_number = self._get_project_number(self._bucket)
> File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py",
> line 166, in get_project_number
> self.bucket_to_project_number[bucket] = bucket_metadata.projectNumber
> AttributeError: 'NoneType' object has no attribute 'projectNumber' [while
> running 'read from GCS']
> {code}
>
> The error message does not hint what goes wrong exactly but after some
> digging my assumption is that when trying to get the `bucket_metadata ` in
> [get_project_number|https://github.com/roger-mike/beam/blob/f0d0dd561a0955afb73cf595a3015a7ca839d5b7/sdks/python/apache_beam/io/gcp/gcsio.py#L161]
> we get a a HTTP Error and thus a None (since when catching this error a None
> is returned) due to the lack of permissions leading to `bucket_metadata`
> being None.
> The problem is, that the required permission (`storage.buckets.get`) is only
> covered in the predefined role `Storage Admin (roles/storage.admin)` which I
> believe shouldn't be necessary in order to access objects from GCS.
> Not sure what the solution would look like: We want the metadata incl. the
> project number but on the other hand it seems excessive to have to give
> storage admin (or having to create custom roles) in order to work with GCS
> objects. In any case this situation needs a more elaborate error message.
> [get_project_number|https://github.com/roger-mike/beam/blob/f0d0dd561a0955afb73cf595a3015a7ca839d5b7/sdks/python/apache_beam/io/gcp/gcsio.py#L161]
> should handle the situation of getting a None from
> [get_bucket|https://github.com/roger-mike/beam/blob/f0d0dd561a0955afb73cf595a3015a7ca839d5b7/sdks/python/apache_beam/io/gcp/gcsio.py#L176]
> gracefully than failing on an Attribute error as seen above.
> Note: This issue will probably not only occur in the Python SDK, but I
> believe to have checked the Java implementation for this and at least there
> we should be getting a more precise error.
> First issue, don't eat me alive :)
--
This message was sent by Atlassian Jira
(v8.20.1#820001)