[GitHub] [spark] HeartSaVioR commented on a change in pull request #28422: [SPARK-17604][SS] FileStreamSource: provide a new option to have retention on input files

GitBox Fri, 01 May 2020 07:42:15 -0700


HeartSaVioR commented on a change in pull request #28422:
URL: https://github.com/apache/spark/pull/28422#discussion_r418571213




##########
File path: docs/structured-streaming-programming-guide.md
##########
@@ -542,6 +542,12 @@ Here are the details of all the sources in Spark.
         <br/>
         <code>maxFileAge</code>: Maximum age of a file that can be found in 
this directory, before it is ignored. For the first batch all files will be 
considered valid. If <code>latestFirst</code> is set to `true` and 
<code>maxFilesPerTrigger</code> is set, then this parameter will be ignored, 
because old files that are valid, and should be processed, may be ignored. The 
max age is specified with respect to the timestamp of the latest file, and not 
the timestamp of the current system.(default: 1 week)
         <br/>
+        <code>inputRetention</code>: Maximum age of a file that can be found 
in this directory, before it is ignored.<br/>
+        This is the "hard" limit of input data retention - input files older 
than the max age will be ignored regardless of source options (while 
`maxFileAgeMs` depends on the condition), as well as entries in checkpoint 
metadata will be purged based on this.<br/>
+        Unlike `maxFileAgeMs`, the max age is specified with respect to the 
timestamp of the current system, to provide consistent behavior regardless of 
metadata entries.<br/>
+        NOTE 1: Please be careful to set the value if the query replays from 
the old input files.<br/>
+        NOTE 2: Please make sure the timestamp is in sync between nodes which 
run the query.<br/>
+        <br/>

Review comment:
       Looks like the kinds of values weren't specified in many options, but 
implied by default values. This option doesn't have default value - maybe 
better to explicitly specify kind of value. Good point!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on a change in pull request #28422: [SPARK-17604][SS] FileStreamSource: provide a new option to have retention on input files

Reply via email to