[
https://issues.apache.org/jira/browse/BEAM-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Daniel Halperin reassigned BEAM-1822:
-------------------------------------
Assignee: (was: Daniel Halperin)
> Improve handling of eventually-consistent filepatterns
> ------------------------------------------------------
>
> Key: BEAM-1822
> URL: https://issues.apache.org/jira/browse/BEAM-1822
> Project: Beam
> Issue Type: Bug
> Components: sdk-java-core
> Reporter: Eugene Kirpichov
>
> Reading from an eventually consistent filepattern (e.g. located in a
> multi-regional Google Cloud Storage bucket, etc.) using FileBasedSource is
> dangerous, because it may silently process fewer data than the user thinks,
> in case not all files get returned by the match call.
> We should improve our handling of this case. I'd suggest to aim for
> minimizing the chance of silent data loss. Here's a couple of things we could
> do.
> - Let the user supply an expected number of files to be matched, and fail the
> pipeline if the actual number is different. For special filepatterns like
> XXX-of-YYY, we can autodetect the expected number.
> - Poll the filepattern for a while (perhaps for a period determined by the
> underlying IOChannelFactory that knows the typical eventual consistency
> convergence times of its filesystem), and either wait until it quiesces, or
> fail the pipeline if it doesn't
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)