vanzin commented on a change in pull request #26920: [SPARK-30281][SS] Consider 
partitioned/recursive option while verifying archive path on FileStreamSource
URL: https://github.com/apache/spark/pull/26920#discussion_r358999809
 
 

 ##########
 File path: docs/structured-streaming-programming-guide.md
 ##########
 @@ -548,7 +548,8 @@ Here are the details of all the sources in Spark.
         "s3a://a/b/c/dataset.txt"<br/>
         <code>cleanSource</code>: option to clean up completed files after 
processing.<br/>
         Available options are "archive", "delete", "off". If the option is not 
provided, the default value is "off".<br/>
-        When "archive" is provided, additional option 
<code>sourceArchiveDir</code> must be provided as well. The value of 
"sourceArchiveDir" must have 2 subdirectories (so depth of directory is greater 
than 2). e.g. <code>/archived/here</code>. This will ensure archived files are 
never included as new source files.<br/>
+        When "archive" is provided, additional option 
<code>sourceArchiveDir</code> must be provided as well. The value of 
"sourceArchiveDir" should ensure some condition to guarantee archived files are 
never included as new source files:
 
 Review comment:
   Code looks ok but the documentation is kinda hard to follow.
   
   First, the whole "should ensure some condition" part is redundant since 
there is a single condition. Just replace it with the following sentence.
   
   The following sentence can be reworded a bit to be clearer, too:
   
   ```
   The value of <code>sourceArchiveDir</code> must not match the source 
pattern, when considering just the prefix of the paths that match in 
subdirectory depth. Otherwise archived files would be considered new source 
files.
   ```
   
   (It's kinda hard to explain the depth thing with words in documentation. It 
always sounds a bit confusing. An example would be much clearer.)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to