[ 
https://issues.apache.org/jira/browse/BEAM-12879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17466200#comment-17466200
 ] 

Ahmet Altay commented on BEAM-12879:
------------------------------------

[~tvalentyn] - Assigned this to find an owner. I do not think it is very 
urgent, but is definitely a getting started and ease of use issue. It is also 
possible a simple fix by ignoring the errors for unauthenticated users.

> Downloading GCS objects suddenly require storage.buckets.get permission
> -----------------------------------------------------------------------
>
>                 Key: BEAM-12879
>                 URL: https://issues.apache.org/jira/browse/BEAM-12879
>             Project: Beam
>          Issue Type: Bug
>          Components: io-py-gcp
>    Affects Versions: 2.32.0
>            Reporter: Robert Jany
>            Assignee: Valentyn Tymofieiev
>            Priority: P2
>
> With PR [https://github.com/apache/beam/pull/14770] downloading GCS objects 
> requires an additional IAM role `storage.objects.get` to get the 
> project_number based on the bucket name. 
> If the service account or user does not have said role the following error 
> will show:
> {code:python}
> Traceback (most recent call last):
>   File 
> "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 
> 651, in do_work
>     work_executor.execute()
>   File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", 
> line 179, in execute
>     op.start()
>   File "dataflow_worker/native_operations.py", line 38, in 
> dataflow_worker.native_operations.NativeReadOperation.start
>   File "dataflow_worker/native_operations.py", line 39, in 
> dataflow_worker.native_operations.NativeReadOperation.start
>   File "dataflow_worker/native_operations.py", line 44, in 
> dataflow_worker.native_operations.NativeReadOperation.start
>   File "dataflow_worker/native_operations.py", line 54, in 
> dataflow_worker.native_operations.NativeReadOperation.start
>   File "apache_beam/runners/worker/operations.py", line 353, in 
> apache_beam.runners.worker.operations.Operation.output
>   File "apache_beam/runners/worker/operations.py", line 215, in 
> apache_beam.runners.worker.operations.SingletonConsumerSet.receive
>   File "apache_beam/runners/worker/operations.py", line 712, in 
> apache_beam.runners.worker.operations.DoOperation.process
>   File "apache_beam/runners/worker/operations.py", line 713, in 
> apache_beam.runners.worker.operations.DoOperation.process
>   File "apache_beam/runners/common.py", line 1234, in 
> apache_beam.runners.common.DoFnRunner.process
>   File "apache_beam/runners/common.py", line 1315, in 
> apache_beam.runners.common.DoFnRunner._reraise_augmented
>   File "apache_beam/runners/common.py", line 1232, in 
> apache_beam.runners.common.DoFnRunner.process
>   File "apache_beam/runners/common.py", line 571, in 
> apache_beam.runners.common.SimpleInvoker.invoke_process
>   File "apache_beam/runners/common.py", line 1368, in 
> apache_beam.runners.common._OutputProcessor.process_outputs
>   File "/usr/local/lib/python3.7/site-packages/xyz/package/file.py", line 
> 112, in process
>     with FileSystems.open(element["gcs_uri"]) as file:
>   File 
> "/usr/local/lib/python3.7/site-packages/apache_beam/io/filesystems.py", line 
> 244, in open
>     return filesystem.open(path, mime_type, compression_type)
>   File 
> "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsfilesystem.py", 
> line 177, in open
>     return self._path_open(path, 'rb', mime_type, compression_type)
>   File 
> "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsfilesystem.py", 
> line 138, in _path_open
>     raw_file = gcsio.GcsIO().open(path, mode, mime_type=mime_type)
>   File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py", 
> line 227, in open
>     get_project_number=self.get_project_number)
>   File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py", 
> line 585, in __init__
>     project_number = self._get_project_number(self._bucket)
>   File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py", 
> line 166, in get_project_number
>     self.bucket_to_project_number[bucket] = bucket_metadata.projectNumber
> AttributeError: 'NoneType' object has no attribute 'projectNumber' [while 
> running 'read from GCS']
> {code}
>  
> The error message does not hint what goes wrong exactly but after some 
> digging my assumption is that when trying to get the `bucket_metadata ` in 
> [get_project_number|https://github.com/roger-mike/beam/blob/f0d0dd561a0955afb73cf595a3015a7ca839d5b7/sdks/python/apache_beam/io/gcp/gcsio.py#L161]
>  we get a a HTTP Error and thus a None (since when catching this error a None 
> is returned) due to the lack of permissions leading to `bucket_metadata` 
> being None.
> The problem is, that the required permission (`storage.buckets.get`) is only 
> covered in the predefined role `Storage Admin (roles/storage.admin)` which I 
> believe shouldn't be necessary in order to access objects from GCS.
> Not sure what the solution would look like: We want the metadata incl. the 
> project number but on the other hand it seems excessive to have to give 
> storage admin (or having to create custom roles) in order to work with GCS 
> objects. In any case this situation needs a more elaborate error message. 
> [get_project_number|https://github.com/roger-mike/beam/blob/f0d0dd561a0955afb73cf595a3015a7ca839d5b7/sdks/python/apache_beam/io/gcp/gcsio.py#L161]
>  should handle the situation of getting a None from 
> [get_bucket|https://github.com/roger-mike/beam/blob/f0d0dd561a0955afb73cf595a3015a7ca839d5b7/sdks/python/apache_beam/io/gcp/gcsio.py#L176]
>  gracefully than failing on an Attribute error as seen above.
> Note: This issue will probably not only occur in the Python SDK, but I 
> believe to have checked the Java implementation for this and at least there 
> we should be getting a more precise error.
> First issue, don't eat me alive :)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to