[ 
https://issues.apache.org/jira/browse/BEAM-12879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Jany updated BEAM-12879:
-------------------------------
    Description: 
With PR [https://github.com/apache/beam/pull/14770] downloading GCS objects 
requires an additional IAM role `storage.objects.get` to get the project_number 
based on the bucket name. 

If the service account or user does not have said role the following error will 
show:
{code:python}
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", 
line 651, in do_work
    work_executor.execute()
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", 
line 179, in execute
    op.start()
  File "dataflow_worker/native_operations.py", line 38, in 
dataflow_worker.native_operations.NativeReadOperation.start
  File "dataflow_worker/native_operations.py", line 39, in 
dataflow_worker.native_operations.NativeReadOperation.start
  File "dataflow_worker/native_operations.py", line 44, in 
dataflow_worker.native_operations.NativeReadOperation.start
  File "dataflow_worker/native_operations.py", line 54, in 
dataflow_worker.native_operations.NativeReadOperation.start
  File "apache_beam/runners/worker/operations.py", line 353, in 
apache_beam.runners.worker.operations.Operation.output
  File "apache_beam/runners/worker/operations.py", line 215, in 
apache_beam.runners.worker.operations.SingletonConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 712, in 
apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/worker/operations.py", line 713, in 
apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/common.py", line 1234, in 
apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 1315, in 
apache_beam.runners.common.DoFnRunner._reraise_augmented
  File "apache_beam/runners/common.py", line 1232, in 
apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 571, in 
apache_beam.runners.common.SimpleInvoker.invoke_process
  File "apache_beam/runners/common.py", line 1368, in 
apache_beam.runners.common._OutputProcessor.process_outputs
  File "/usr/local/lib/python3.7/site-packages/xyz/package/file.py", line 112, 
in process
    with FileSystems.open(element["gcs_uri"]) as file:
  File "/usr/local/lib/python3.7/site-packages/apache_beam/io/filesystems.py", 
line 244, in open
    return filesystem.open(path, mime_type, compression_type)
  File 
"/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsfilesystem.py", 
line 177, in open
    return self._path_open(path, 'rb', mime_type, compression_type)
  File 
"/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsfilesystem.py", 
line 138, in _path_open
    raw_file = gcsio.GcsIO().open(path, mode, mime_type=mime_type)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py", 
line 227, in open
    get_project_number=self.get_project_number)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py", 
line 585, in __init__
    project_number = self._get_project_number(self._bucket)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py", 
line 166, in get_project_number
    self.bucket_to_project_number[bucket] = bucket_metadata.projectNumber
AttributeError: 'NoneType' object has no attribute 'projectNumber' [while 
running 'read from GCS']
{code}
 

The error message does not hint what goes wrong exactly but after some digging 
my assumption is that when trying to get the `bucket_metadata ` in 
[get_project_number|https://github.com/roger-mike/beam/blob/f0d0dd561a0955afb73cf595a3015a7ca839d5b7/sdks/python/apache_beam/io/gcp/gcsio.py#L161]
 we get a a HTTP Error and thus a None (since when catching this error a None 
is returned) due to the lack of permissions leading to `bucket_metadata` being 
None.

The problem is, that the required permission (`storage.buckets.get`) is only 
covered in the predefined role `Storage Admin (roles/storage.admin)` which I 
believe shouldn't be necessary in order to access objects from GCS.

Not sure what the solution would look like: We want the metadata incl. the 
project number but on the other hand it seems excessive to have to give storage 
admin (or having to create custom roles) in order to work with GCS objects. In 
any case this situation needs a more elaborate error message. 
[get_project_number|https://github.com/roger-mike/beam/blob/f0d0dd561a0955afb73cf595a3015a7ca839d5b7/sdks/python/apache_beam/io/gcp/gcsio.py#L161]
 should handle the situation of getting a None from 
[get_bucket|https://github.com/roger-mike/beam/blob/f0d0dd561a0955afb73cf595a3015a7ca839d5b7/sdks/python/apache_beam/io/gcp/gcsio.py#L176]
 gracefully than failing on an Attribute error as seen above.

Note: This issue will probably not only occur in the Python SDK, but I believe 
to have checked the Java implementation for this and at least there we should 
be getting a more precise error.

First issue, don't eat me alive :)

  was:
With PR [https://github.com/apache/beam/pull/14770] downloading GCS objects 
requires an additional IAM role `storage.objects.get` to get the project_number 
based on the bucket name. 

If the service account or user does not have said role the following error will 
show:
{code:python}
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", 
line 651, in do_work
    work_executor.execute()
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", 
line 179, in execute
    op.start()
  File "dataflow_worker/native_operations.py", line 38, in 
dataflow_worker.native_operations.NativeReadOperation.start
  File "dataflow_worker/native_operations.py", line 39, in 
dataflow_worker.native_operations.NativeReadOperation.start
  File "dataflow_worker/native_operations.py", line 44, in 
dataflow_worker.native_operations.NativeReadOperation.start
  File "dataflow_worker/native_operations.py", line 54, in 
dataflow_worker.native_operations.NativeReadOperation.start
  File "apache_beam/runners/worker/operations.py", line 353, in 
apache_beam.runners.worker.operations.Operation.output
  File "apache_beam/runners/worker/operations.py", line 215, in 
apache_beam.runners.worker.operations.SingletonConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 712, in 
apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/worker/operations.py", line 713, in 
apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/common.py", line 1234, in 
apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 1315, in 
apache_beam.runners.common.DoFnRunner._reraise_augmented
  File "apache_beam/runners/common.py", line 1232, in 
apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 571, in 
apache_beam.runners.common.SimpleInvoker.invoke_process
  File "apache_beam/runners/common.py", line 1368, in 
apache_beam.runners.common._OutputProcessor.process_outputs
  File 
"/usr/local/lib/python3.7/site-packages/brain_picker/transformation/cdd_resume.py",
 line 112, in process
    with FileSystems.open(element["gcs_uri"]) as file:
  File "/usr/local/lib/python3.7/site-packages/apache_beam/io/filesystems.py", 
line 244, in open
    return filesystem.open(path, mime_type, compression_type)
  File 
"/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsfilesystem.py", 
line 177, in open
    return self._path_open(path, 'rb', mime_type, compression_type)
  File 
"/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsfilesystem.py", 
line 138, in _path_open
    raw_file = gcsio.GcsIO().open(path, mode, mime_type=mime_type)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py", 
line 227, in open
    get_project_number=self.get_project_number)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py", 
line 585, in __init__
    project_number = self._get_project_number(self._bucket)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py", 
line 166, in get_project_number
    self.bucket_to_project_number[bucket] = bucket_metadata.projectNumber
AttributeError: 'NoneType' object has no attribute 'projectNumber' [while 
running 'read from GCS']
{code}
 

The error message does not hint what goes wrong exactly but after some digging 
my assumption is that when trying to get the `bucket_metadata ` in 
[get_project_number|https://github.com/roger-mike/beam/blob/f0d0dd561a0955afb73cf595a3015a7ca839d5b7/sdks/python/apache_beam/io/gcp/gcsio.py#L161]
 we get a a HTTP Error and thus a None (since when catching this error a None 
is returned) due to the lack of permissions leading to `bucket_metadata` being 
None.

The problem is, that the required permission (`storage.buckets.get`) is only 
covered in the predefined role `Storage Admin (roles/storage.admin)` which I 
believe shouldn't be necessary in order to access objects from GCS.

Not sure what the solution would look like: We want the metadata incl. the 
project number but on the other hand it seems excessive to have to give storage 
admin (or having to create custom roles) in order to work with GCS objects. In 
any case this situation needs a more elaborate error message. 
[get_project_number|https://github.com/roger-mike/beam/blob/f0d0dd561a0955afb73cf595a3015a7ca839d5b7/sdks/python/apache_beam/io/gcp/gcsio.py#L161]
 should handle the situation of getting a None from 
[get_bucket|https://github.com/roger-mike/beam/blob/f0d0dd561a0955afb73cf595a3015a7ca839d5b7/sdks/python/apache_beam/io/gcp/gcsio.py#L176]
 gracefully than failing on an Attribute error as seen above.

Note: This issue will probably not only occur in the Python SDK, but I believe 
to have checked the Java implementation for this and at least there we should 
be getting a more precise error.

First issue, don't eat me alive :)


> Downloading GCS objects suddenly require storage.buckets.get permission
> -----------------------------------------------------------------------
>
>                 Key: BEAM-12879
>                 URL: https://issues.apache.org/jira/browse/BEAM-12879
>             Project: Beam
>          Issue Type: Bug
>          Components: io-py-gcp
>    Affects Versions: 2.32.0
>            Reporter: Robert Jany
>            Priority: P2
>
> With PR [https://github.com/apache/beam/pull/14770] downloading GCS objects 
> requires an additional IAM role `storage.objects.get` to get the 
> project_number based on the bucket name. 
> If the service account or user does not have said role the following error 
> will show:
> {code:python}
> Traceback (most recent call last):
>   File 
> "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 
> 651, in do_work
>     work_executor.execute()
>   File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", 
> line 179, in execute
>     op.start()
>   File "dataflow_worker/native_operations.py", line 38, in 
> dataflow_worker.native_operations.NativeReadOperation.start
>   File "dataflow_worker/native_operations.py", line 39, in 
> dataflow_worker.native_operations.NativeReadOperation.start
>   File "dataflow_worker/native_operations.py", line 44, in 
> dataflow_worker.native_operations.NativeReadOperation.start
>   File "dataflow_worker/native_operations.py", line 54, in 
> dataflow_worker.native_operations.NativeReadOperation.start
>   File "apache_beam/runners/worker/operations.py", line 353, in 
> apache_beam.runners.worker.operations.Operation.output
>   File "apache_beam/runners/worker/operations.py", line 215, in 
> apache_beam.runners.worker.operations.SingletonConsumerSet.receive
>   File "apache_beam/runners/worker/operations.py", line 712, in 
> apache_beam.runners.worker.operations.DoOperation.process
>   File "apache_beam/runners/worker/operations.py", line 713, in 
> apache_beam.runners.worker.operations.DoOperation.process
>   File "apache_beam/runners/common.py", line 1234, in 
> apache_beam.runners.common.DoFnRunner.process
>   File "apache_beam/runners/common.py", line 1315, in 
> apache_beam.runners.common.DoFnRunner._reraise_augmented
>   File "apache_beam/runners/common.py", line 1232, in 
> apache_beam.runners.common.DoFnRunner.process
>   File "apache_beam/runners/common.py", line 571, in 
> apache_beam.runners.common.SimpleInvoker.invoke_process
>   File "apache_beam/runners/common.py", line 1368, in 
> apache_beam.runners.common._OutputProcessor.process_outputs
>   File "/usr/local/lib/python3.7/site-packages/xyz/package/file.py", line 
> 112, in process
>     with FileSystems.open(element["gcs_uri"]) as file:
>   File 
> "/usr/local/lib/python3.7/site-packages/apache_beam/io/filesystems.py", line 
> 244, in open
>     return filesystem.open(path, mime_type, compression_type)
>   File 
> "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsfilesystem.py", 
> line 177, in open
>     return self._path_open(path, 'rb', mime_type, compression_type)
>   File 
> "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsfilesystem.py", 
> line 138, in _path_open
>     raw_file = gcsio.GcsIO().open(path, mode, mime_type=mime_type)
>   File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py", 
> line 227, in open
>     get_project_number=self.get_project_number)
>   File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py", 
> line 585, in __init__
>     project_number = self._get_project_number(self._bucket)
>   File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py", 
> line 166, in get_project_number
>     self.bucket_to_project_number[bucket] = bucket_metadata.projectNumber
> AttributeError: 'NoneType' object has no attribute 'projectNumber' [while 
> running 'read from GCS']
> {code}
>  
> The error message does not hint what goes wrong exactly but after some 
> digging my assumption is that when trying to get the `bucket_metadata ` in 
> [get_project_number|https://github.com/roger-mike/beam/blob/f0d0dd561a0955afb73cf595a3015a7ca839d5b7/sdks/python/apache_beam/io/gcp/gcsio.py#L161]
>  we get a a HTTP Error and thus a None (since when catching this error a None 
> is returned) due to the lack of permissions leading to `bucket_metadata` 
> being None.
> The problem is, that the required permission (`storage.buckets.get`) is only 
> covered in the predefined role `Storage Admin (roles/storage.admin)` which I 
> believe shouldn't be necessary in order to access objects from GCS.
> Not sure what the solution would look like: We want the metadata incl. the 
> project number but on the other hand it seems excessive to have to give 
> storage admin (or having to create custom roles) in order to work with GCS 
> objects. In any case this situation needs a more elaborate error message. 
> [get_project_number|https://github.com/roger-mike/beam/blob/f0d0dd561a0955afb73cf595a3015a7ca839d5b7/sdks/python/apache_beam/io/gcp/gcsio.py#L161]
>  should handle the situation of getting a None from 
> [get_bucket|https://github.com/roger-mike/beam/blob/f0d0dd561a0955afb73cf595a3015a7ca839d5b7/sdks/python/apache_beam/io/gcp/gcsio.py#L176]
>  gracefully than failing on an Attribute error as seen above.
> Note: This issue will probably not only occur in the Python SDK, but I 
> believe to have checked the Java implementation for this and at least there 
> we should be getting a more precise error.
> First issue, don't eat me alive :)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to