lgeiger commented on issue #28398: URL: https://github.com/apache/beam/issues/28398#issuecomment-1923661050
@BjornPrime It would be great if the the amount of GET requests in GCSIO could be reduced. In particular the code seems to often call `client.get_bucket()` or `client.lookup_bucket()` which will do an unnecessary GET request compared to just using `client.bucket()` or `storage.Blob.from_string()` where possible. Often this doubles the amount of requests needed to e.g. read a blob which is significantly slower. Furthermore `get_bucket` and `lookup_bucket` require `storage.buckets.get` permissions. These are not part of the "Storage Object Viewer" role but only available in the "Storage Object Admin" role which also grants write permission. This means people running in environments that only allow read access to certain buckets will not be able use the new GCSIO implementation to read blobs. This currently broke one of our pipelines when trying to upgrade from 2.52 to 2.53. As far as I can tell switching from `get_bucket` to `bucket` is also recommended in the [docs](https://cloud.google.com/storage/docs/downloading-objects#storage-download-object-python), so not sure where the permission error that you mention is coming from. Out of interest, are there any advantages of using the beams builtin GCSIO over a generic implementation like [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) in pipeline? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
