[
https://issues.apache.org/jira/browse/BEAM-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15946125#comment-15946125
]
Daniel Halperin commented on BEAM-1822:
---------------------------------------
Also -- just to clarify -- this would be user-induced "silent data loss" , not
a failure of Beam. But I agree that it would not be what users expect unless
they think hard about how the external services they're using work.
> Improve handling of eventually-consistent filepatterns
> ------------------------------------------------------
>
> Key: BEAM-1822
> URL: https://issues.apache.org/jira/browse/BEAM-1822
> Project: Beam
> Issue Type: Bug
> Components: sdk-java-core
> Reporter: Eugene Kirpichov
>
> Reading from an eventually consistent filepattern (e.g. located in a
> multi-regional Google Cloud Storage bucket, etc.) using FileBasedSource is
> dangerous, because it may silently process fewer data than the user thinks,
> in case not all files get returned by the match call.
> We should improve our handling of this case. I'd suggest to aim for
> minimizing the chance of silent data loss. Here's a couple of things we could
> do.
> - Let the user supply an expected number of files to be matched, and fail the
> pipeline if the actual number is different. For special filepatterns like
> XXX-of-YYY, we can autodetect the expected number.
> - Poll the filepattern for a while (perhaps for a period determined by the
> underlying IOChannelFactory that knows the typical eventual consistency
> convergence times of its filesystem), and either wait until it quiesces, or
> fail the pipeline if it doesn't
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)