[jira] [Commented] (BEAM-522) Update FileSink.finalize_write() to be idempotent

2016-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15415985#comment-15415985
 ] 

ASF GitHub Bot commented on BEAM-522:
-

Github user asfgit closed the pull request at:

https://github.com/apache/incubator-beam/pull/779


> Update FileSink.finalize_write() to be idempotent
> -
>
> Key: BEAM-522
> URL: https://issues.apache.org/jira/browse/BEAM-522
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Chamikara Jayalath
>Assignee: Chamikara Jayalath
>
> Currently FileSink.finelize_write() in fileio.py [1] performs following 
> operations.
> (1) Obtains a list of temporary files as a side input
> (2) Renames each temporary file to the location where final output should be 
> stored.
> iobase.Sink.finalize_write() operation should be idempotent since runner 
> implementations may call this operation multiple times due to task failures. 
> Current implementation is not idempotent because if we re-run the operation 
> after renaming a sub-set of files, the operations may fail due to not being 
> able to find some files at source location (for example, [2] for GCS files).
> We can fix this by checking if the destination file is already available 
> before performing the rename and not performing the rename for files that are 
> already available at the destination.
> [1] 
> https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/fileio.py#L503
> [2] 
> https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/gcsio.py#L187
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-522) Update FileSink.finalize_write() to be idempotent

2016-08-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406959#comment-15406959
 ] 

ASF GitHub Bot commented on BEAM-522:
-

GitHub user chamikaramj opened a pull request:

https://github.com/apache/incubator-beam/pull/779

[BEAM-522] Fixes GcsIO.exists() to properly handle files that do not exist

Currently this invocation fails for non existing files instead of returning 
false.

Updates FileSink.finalize_write() so that we capture and log any transient 
errors that get thrown at the channel_factory.exists() call.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/chamikaramj/incubator-beam 
sink_finalize_fix_idempotency

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/779.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #779


commit 792c3b5c79b6e979bc34bcf457f8a33cebd74daf
Author: Chamikara Jayalath 
Date:   2016-08-04T01:25:41Z

Fixes GcsIO.exists() to properly handle files that do not exist.

Currently this invocation fails for non existing files instead of returning 
false.

Updates FileSink.finalize_write() so that we capture and log any transient 
errors that get thrown at the channel_factory.exists() call.




> Update FileSink.finalize_write() to be idempotent
> -
>
> Key: BEAM-522
> URL: https://issues.apache.org/jira/browse/BEAM-522
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Chamikara Jayalath
>Assignee: Chamikara Jayalath
>
> Currently FileSink.finelize_write() in fileio.py [1] performs following 
> operations.
> (1) Obtains a list of temporary files as a side input
> (2) Renames each temporary file to the location where final output should be 
> stored.
> iobase.Sink.finalize_write() operation should be idempotent since runner 
> implementations may call this operation multiple times due to task failures. 
> Current implementation is not idempotent because if we re-run the operation 
> after renaming a sub-set of files, the operations may fail due to not being 
> able to find some files at source location (for example, [2] for GCS files).
> We can fix this by checking if the destination file is already available 
> before performing the rename and not performing the rename for files that are 
> already available at the destination.
> [1] 
> https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/fileio.py#L503
> [2] 
> https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/gcsio.py#L187
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-522) Update FileSink.finalize_write() to be idempotent

2016-08-03 Thread Chamikara Jayalath (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406952#comment-15406952
 ] 

Chamikara Jayalath commented on BEAM-522:
-

Actually, the bug is in the exists() implementation of gcsio.py.
https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/gcsio.py#L237

Instead of catching IOError, we should be catching HttpError and checking error 
code to see if it's 404.

With  this fixed FileSink.finalize_write() becomes properly idempotent since we 
handle failures of rename() invocation at following location.
https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/fileio.py#L533

> Update FileSink.finalize_write() to be idempotent
> -
>
> Key: BEAM-522
> URL: https://issues.apache.org/jira/browse/BEAM-522
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Chamikara Jayalath
>Assignee: Chamikara Jayalath
>
> Currently FileSink.finelize_write() in fileio.py [1] performs following 
> operations.
> (1) Obtains a list of temporary files as a side input
> (2) Renames each temporary file to the location where final output should be 
> stored.
> iobase.Sink.finalize_write() operation should be idempotent since runner 
> implementations may call this operation multiple times due to task failures. 
> Current implementation is not idempotent because if we re-run the operation 
> after renaming a sub-set of files, the operations may fail due to not being 
> able to find some files at source location (for example, [2] for GCS files).
> We can fix this by checking if the destination file is already available 
> before performing the rename and not performing the rename for files that are 
> already available at the destination.
> [1] 
> https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/fileio.py#L503
> [2] 
> https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/gcsio.py#L187
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)