johnjcasey commented on code in PR #28206:
URL: https://github.com/apache/beam/pull/28206#discussion_r1310757661
##########
sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java:
##########
@@ -509,12 +509,22 @@ public MatchConfiguration
withEmptyMatchTreatment(EmptyMatchTreatment treatment)
* condition is reached, where the input to the condition is the
filepattern.
*
* <p>If {@code matchUpdatedFiles} is set, also watches for files with
timestamp change, with
- * the watching frequency given by the {@code interval}. The pipeline will
throw a {@code
- * RuntimeError} if timestamp extraction for the matched file has failed,
suggesting the
+ * the watching frequency given by the {@code interval}. The pipeline will
throw a
+ * {@code RuntimeError} if timestamp extraction for the matched file has
failed, suggesting the
* timestamp metadata is not available with the IO connector.
+ *
+ * <p>
+ * Matching continuously scales poorly, as it is stateful, and requires
storing file ids in
+ * memory. In addition, because it is memory-only, if a pipeline is
restarted, already processed
+ * files will be reprocessed. Consider an alternate technique, such as
+ * <a
href="https://cloud.google.com/storage/docs/pubsub-notifications">Pub/Sub
Notifications</a>
+ * when using GCS if possible.
+ * </p>
*/
public MatchConfiguration continuously(
Duration interval, TerminationCondition<String, ?> condition, boolean
matchUpdatedFiles) {
+ LOG.warn("Matching Continuously is stateful, and can scale poorly.
Consider using Pub/Sub "
Review Comment:
will do
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]