Abacn commented on issue #20137:
URL: https://github.com/apache/beam/issues/20137#issuecomment-1194124960
@rviscomi From the source code you linked it seems there is a two stage
ReadAllFromText(), when `input_file` is set. Nevertheless, from the Beam side,
there are things could be optimized. When validating if there will be at least
one file read:
https://github.com/apache/beam/blob/54b0784da7ccba738deff22bd83fbc374ad21d2e/sdks/python/apache_beam/io/filebasedsource.py#L187
current gcsio will essentially try to read all files because it returns a
dict instead of using lazy evaluation:
https://github.com/apache/beam/blob/54b0784da7ccba738deff22bd83fbc374ad21d2e/sdks/python/apache_beam/io/gcp/gcsio.py#L611
which causes duplicate ops.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]