[jira] [Commented] (BEAM-522) Update FileSink.finalize_write() to be idempotent
[ https://issues.apache.org/jira/browse/BEAM-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15415985#comment-15415985 ] ASF GitHub Bot commented on BEAM-522: - Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/779 > Update FileSink.finalize_write() to be idempotent > - > > Key: BEAM-522 > URL: https://issues.apache.org/jira/browse/BEAM-522 > Project: Beam > Issue Type: Bug > Components: sdk-py >Reporter: Chamikara Jayalath >Assignee: Chamikara Jayalath > > Currently FileSink.finelize_write() in fileio.py [1] performs following > operations. > (1) Obtains a list of temporary files as a side input > (2) Renames each temporary file to the location where final output should be > stored. > iobase.Sink.finalize_write() operation should be idempotent since runner > implementations may call this operation multiple times due to task failures. > Current implementation is not idempotent because if we re-run the operation > after renaming a sub-set of files, the operations may fail due to not being > able to find some files at source location (for example, [2] for GCS files). > We can fix this by checking if the destination file is already available > before performing the rename and not performing the rename for files that are > already available at the destination. > [1] > https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/fileio.py#L503 > [2] > https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/gcsio.py#L187 > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-522) Update FileSink.finalize_write() to be idempotent
[ https://issues.apache.org/jira/browse/BEAM-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406959#comment-15406959 ] ASF GitHub Bot commented on BEAM-522: - GitHub user chamikaramj opened a pull request: https://github.com/apache/incubator-beam/pull/779 [BEAM-522] Fixes GcsIO.exists() to properly handle files that do not exist Currently this invocation fails for non existing files instead of returning false. Updates FileSink.finalize_write() so that we capture and log any transient errors that get thrown at the channel_factory.exists() call. You can merge this pull request into a Git repository by running: $ git pull https://github.com/chamikaramj/incubator-beam sink_finalize_fix_idempotency Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/779.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #779 commit 792c3b5c79b6e979bc34bcf457f8a33cebd74daf Author: Chamikara JayalathDate: 2016-08-04T01:25:41Z Fixes GcsIO.exists() to properly handle files that do not exist. Currently this invocation fails for non existing files instead of returning false. Updates FileSink.finalize_write() so that we capture and log any transient errors that get thrown at the channel_factory.exists() call. > Update FileSink.finalize_write() to be idempotent > - > > Key: BEAM-522 > URL: https://issues.apache.org/jira/browse/BEAM-522 > Project: Beam > Issue Type: Bug > Components: sdk-py >Reporter: Chamikara Jayalath >Assignee: Chamikara Jayalath > > Currently FileSink.finelize_write() in fileio.py [1] performs following > operations. > (1) Obtains a list of temporary files as a side input > (2) Renames each temporary file to the location where final output should be > stored. > iobase.Sink.finalize_write() operation should be idempotent since runner > implementations may call this operation multiple times due to task failures. > Current implementation is not idempotent because if we re-run the operation > after renaming a sub-set of files, the operations may fail due to not being > able to find some files at source location (for example, [2] for GCS files). > We can fix this by checking if the destination file is already available > before performing the rename and not performing the rename for files that are > already available at the destination. > [1] > https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/fileio.py#L503 > [2] > https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/gcsio.py#L187 > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-522) Update FileSink.finalize_write() to be idempotent
[ https://issues.apache.org/jira/browse/BEAM-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406952#comment-15406952 ] Chamikara Jayalath commented on BEAM-522: - Actually, the bug is in the exists() implementation of gcsio.py. https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/gcsio.py#L237 Instead of catching IOError, we should be catching HttpError and checking error code to see if it's 404. With this fixed FileSink.finalize_write() becomes properly idempotent since we handle failures of rename() invocation at following location. https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/fileio.py#L533 > Update FileSink.finalize_write() to be idempotent > - > > Key: BEAM-522 > URL: https://issues.apache.org/jira/browse/BEAM-522 > Project: Beam > Issue Type: Bug > Components: sdk-py >Reporter: Chamikara Jayalath >Assignee: Chamikara Jayalath > > Currently FileSink.finelize_write() in fileio.py [1] performs following > operations. > (1) Obtains a list of temporary files as a side input > (2) Renames each temporary file to the location where final output should be > stored. > iobase.Sink.finalize_write() operation should be idempotent since runner > implementations may call this operation multiple times due to task failures. > Current implementation is not idempotent because if we re-run the operation > after renaming a sub-set of files, the operations may fail due to not being > able to find some files at source location (for example, [2] for GCS files). > We can fix this by checking if the destination file is already available > before performing the rename and not performing the rename for files that are > already available at the destination. > [1] > https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/fileio.py#L503 > [2] > https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/gcsio.py#L187 > -- This message was sent by Atlassian JIRA (v6.3.4#6332)