[ 
https://issues.apache.org/jira/browse/BEAM-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15946125#comment-15946125
 ] 

Daniel Halperin commented on BEAM-1822:
---------------------------------------

Also -- just to clarify -- this would be user-induced "silent data loss" , not 
a failure of Beam. But I agree that it would not be what users expect unless 
they think hard about how the external services they're using work.

> Improve handling of eventually-consistent filepatterns
> ------------------------------------------------------
>
>                 Key: BEAM-1822
>                 URL: https://issues.apache.org/jira/browse/BEAM-1822
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-java-core
>            Reporter: Eugene Kirpichov
>
> Reading from an eventually consistent filepattern (e.g. located in a 
> multi-regional Google Cloud Storage bucket, etc.) using FileBasedSource is 
> dangerous, because it may silently process fewer data than the user thinks, 
> in case not all files get returned by the match call.
> We should improve our handling of this case. I'd suggest to aim for 
> minimizing the chance of silent data loss. Here's a couple of things we could 
> do.
> - Let the user supply an expected number of files to be matched, and fail the 
> pipeline if the actual number is different. For special filepatterns like 
> XXX-of-YYY, we can autodetect the expected number.
> - Poll the filepattern for a while (perhaps for a period determined by the 
> underlying IOChannelFactory that knows the typical eventual consistency 
> convergence times of its filesystem), and either wait until it quiesces, or 
> fail the pipeline if it doesn't



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to