AnandInguva commented on code in PR #28206:
URL: https://github.com/apache/beam/pull/28206#discussion_r1309360401


##########
sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java:
##########
@@ -509,12 +509,22 @@ public MatchConfiguration 
withEmptyMatchTreatment(EmptyMatchTreatment treatment)
      * condition is reached, where the input to the condition is the 
filepattern.
      *
      * <p>If {@code matchUpdatedFiles} is set, also watches for files with 
timestamp change, with
-     * the watching frequency given by the {@code interval}. The pipeline will 
throw a {@code
-     * RuntimeError} if timestamp extraction for the matched file has failed, 
suggesting the
+     * the watching frequency given by the {@code interval}. The pipeline will 
throw a
+     * {@code RuntimeError} if timestamp extraction for the matched file has 
failed, suggesting the
      * timestamp metadata is not available with the IO connector.
+     *
+     * <p>
+     * Matching continuously scales poorly, as it is stateful, and requires 
storing file ids in
+     * memory. In addition, because it is memory-only, if a pipeline is 
restarted, already processed
+     * files will be reprocessed. Consider an alternate technique, such as
+     * <a 
href="https://cloud.google.com/storage/docs/pubsub-notifications";>Pub/Sub 
Notifications</a>
+     * when using GCS if possible.
+     * </p>
      */
     public MatchConfiguration continuously(
         Duration interval, TerminationCondition<String, ?> condition, boolean 
matchUpdatedFiles) {
+      LOG.warn("Matching Continuously is stateful, and can scale poorly. 
Consider using Pub/Sub "

Review Comment:
   I think this is true for python sdk as well. If yes, can we add a warning 
there?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to