Chamikara Jayalath created BEAM-522:
---------------------------------------
Summary: Update FileSink.finalize_write() to be idempotent
Key: BEAM-522
URL: https://issues.apache.org/jira/browse/BEAM-522
Project: Beam
Issue Type: Bug
Components: sdk-py
Reporter: Chamikara Jayalath
Assignee: Chamikara Jayalath
Currently FileSink.finelize_write() in fileio.py [1] performs following
operations.
(1) Obtains a list of temporary files as a side input
(2) Renames each temporary file to the location where final output should be
stored.
iobase.Sink.finalize_write() operation should be idempotent since runner
implementations may call this operation multiple times due to task failures.
Current implementation is not idempotent because if we re-run the operation
after renaming a sub-set of files, the operations may fail due to not being
able to find some files at source location (for example, [2] for GCS files).
We can fix this by checking if the destination file is already available before
performing the rename and not performing the rename for files that are already
available at the destination.
[1]
https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/fileio.py#L503
[2]
https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/gcsio.py#L187
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)