Eugene Kirpichov created BEAM-1822:
--------------------------------------

             Summary: Improve handling of eventually-consistent filepatterns
                 Key: BEAM-1822
                 URL: https://issues.apache.org/jira/browse/BEAM-1822
             Project: Beam
          Issue Type: Bug
          Components: sdk-java-core
            Reporter: Eugene Kirpichov
            Assignee: Daniel Halperin


Reading from an eventually consistent filepattern (e.g. located in a 
multi-regional Google Cloud Storage bucket, etc.) using FileBasedSource is 
dangerous, because it may silently process fewer data than the user thinks, in 
case not all files get returned by the match call.

We should improve our handling of this case. I'd suggest to aim for minimizing 
the chance of silent data loss. Here's a couple of things we could do.

- Let the user supply an expected number of files to be matched, and fail the 
pipeline if the actual number is different. For special filepatterns like 
XXX-of-YYY, we can autodetect the expected number.
- Poll the filepattern for a while (perhaps for a period determined by the 
underlying IOChannelFactory that knows the typical eventual consistency 
convergence times of its filesystem), and either wait until it quiesces, or 
fail the pipeline if it doesn't



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to