Eugene Kirpichov created BEAM-1822:
--------------------------------------
Summary: Improve handling of eventually-consistent filepatterns
Key: BEAM-1822
URL: https://issues.apache.org/jira/browse/BEAM-1822
Project: Beam
Issue Type: Bug
Components: sdk-java-core
Reporter: Eugene Kirpichov
Assignee: Daniel Halperin
Reading from an eventually consistent filepattern (e.g. located in a
multi-regional Google Cloud Storage bucket, etc.) using FileBasedSource is
dangerous, because it may silently process fewer data than the user thinks, in
case not all files get returned by the match call.
We should improve our handling of this case. I'd suggest to aim for minimizing
the chance of silent data loss. Here's a couple of things we could do.
- Let the user supply an expected number of files to be matched, and fail the
pipeline if the actual number is different. For special filepatterns like
XXX-of-YYY, we can autodetect the expected number.
- Poll the filepattern for a while (perhaps for a period determined by the
underlying IOChannelFactory that knows the typical eventual consistency
convergence times of its filesystem), and either wait until it quiesces, or
fail the pipeline if it doesn't
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)