Github user HeartSaVioR commented on a diff in the pull request:
https://github.com/apache/spark/pull/22952#discussion_r231717554
--- Diff: docs/structured-streaming-programming-guide.md ---
@@ -530,6 +530,8 @@ Here are the details of all the sources in Spark.
"s3://a/dataset.txt"<br/>
"s3n://a/b/dataset.txt"<br/>
"s3a://a/b/c/dataset.txt"<br/>
+ <br/>
+ <code>renameCompletedFiles</code>: whether to rename completed
files in previous batch (default: false). If the option is enabled, input file
will be renamed with additional postfix "_COMPLETED_". This is useful to clean
up old input files to save space in storage.
--- End diff --
Totally agreed, and that matches the option 3 I've proposed. And option 1
would not affect much on critical path in a batch since rename operations will
be enqueued and background thread will take care.
For option 1, guaranteeing makes the thing being complicated. If we are OK
to NOT guarantee that all processed files are renamed, we can take the renaming
in background (like option 1) without handling backpressure, and simply drop
the requests in queue with logging if the size is beyond the threshold or JVM
is shutting down.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]