lgeiger commented on issue #28398:
URL: https://github.com/apache/beam/issues/28398#issuecomment-1923661050

   @BjornPrime It would be great if the the amount of GET requests in GCSIO 
could be reduced.
   
   In particular the code seems to often call `client.get_bucket()` or 
`client.lookup_bucket()` which will do an unnecessary GET request compared to 
just using `client.bucket()` or `storage.Blob.from_string()` where possible. 
Often this doubles the amount of requests needed to e.g. read a blob which is 
significantly slower.
   
   Furthermore `get_bucket` and `lookup_bucket` require `storage.buckets.get` 
permissions. These are not part of the "Storage Object Viewer" role but only 
available in the "Storage Object Admin" role which also grants write 
permission. This means people running in environments that only allow read 
access to certain buckets will not be able use the new GCSIO implementation to 
read blobs. This currently broke one of our pipelines when trying to upgrade 
from 2.52 to 2.53.
   
   As far as I can tell switching from `get_bucket` to `bucket` is also 
recommended in the 
[docs](https://cloud.google.com/storage/docs/downloading-objects#storage-download-object-python),
 so not sure where the permission error that you mention is coming from.
   
   Out of interest, are there any advantages of using the beams builtin GCSIO 
over a generic implementation like 
[fsspec](https://filesystem-spec.readthedocs.io/en/latest/) in pipeline?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to