vanzin commented on a change in pull request #26502: [SPARK-29876][SS] 
Delete/archive file source completed files in separate thread
URL: https://github.com/apache/spark/pull/26502#discussion_r353497464
 
 

 ##########
 File path: docs/structured-streaming-programming-guide.md
 ##########
 @@ -550,9 +550,10 @@ Here are the details of all the sources in Spark.
         Available options are "archive", "delete", "off". If the option is not 
provided, the default value is "off".<br/>
         When "archive" is provided, additional option 
<code>sourceArchiveDir</code> must be provided as well. The value of 
"sourceArchiveDir" must have 2 subdirectories (so depth of directory is greater 
than 2). e.g. <code>/archived/here</code>. This will ensure archived files are 
never included as new source files.<br/>
         Spark will move source files respecting their own path. For example, 
if the path of source file is <code>/a/b/dataset.txt</code> and the path of 
archive directory is <code>/archived/here</code>, file will be moved to 
<code>/archived/here/a/b/dataset.txt</code>.<br/>
-        NOTE: Both archiving (via moving) or deleting completed files will 
introduce overhead (slow down) in each micro-batch, so you need to understand 
the cost for each operation in your file system before enabling this option. On 
the other hand, enabling this option will reduce the cost to list source files 
which can be an expensive operation.<br/>
+        NOTE: Both archiving (via moving) or deleting completed files will 
introduce overhead (slow down, even if it's happening in separate thread) in 
each micro-batch, so you need to understand the cost for each operation in your 
file system before enabling this option. On the other hand, enabling this 
option will reduce the cost to list source files which can be an expensive 
operation.<br/>
+        Number of threads used in completed file cleaner can be configured 
with<code>spark.sql.streaming.fileSource.cleaner.numThreads</code>.<br/>
 
 Review comment:
   Should mention the default value here.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to