[GitHub] [spark] HeartSaVioR commented on a change in pull request #26920: [SPARK-30281][SS] Consider partitioned/recursive option while verifying archive path on FileStreamSource

GitBox Tue, 17 Dec 2019 15:48:19 -0800

HeartSaVioR commented on a change in pull request #26920: [SPARK-30281][SS] 
Consider partitioned/recursive option while verifying archive path on 
FileStreamSource
URL: https://github.com/apache/spark/pull/26920#discussion_r359048173


 ##########
 File path: docs/structured-streaming-programming-guide.md
 ##########
 @@ -548,7 +548,8 @@ Here are the details of all the sources in Spark.
         "s3a://a/b/c/dataset.txt"<br/>
         <code>cleanSource</code>: option to clean up completed files after 
processing.<br/>
         Available options are "archive", "delete", "off". If the option is not 
provided, the default value is "off".<br/>
-        When "archive" is provided, additional option 
<code>sourceArchiveDir</code> must be provided as well. The value of 
"sourceArchiveDir" must have 2 subdirectories (so depth of directory is greater 
than 2). e.g. <code>/archived/here</code>. This will ensure archived files are 
never included as new source files.<br/>
+        When "archive" is provided, additional option 
<code>sourceArchiveDir</code> must be provided as well. The value of 
"sourceArchiveDir" should ensure some condition to guarantee archived files are 
never included as new source files:
 
 Review comment:
   Actually that made me want to stick with simple condition as current (as I 
also felt that end users may not be easy to follow the rule), though 
unfortunately we found the cases which we no longer be able to do that.
   
   I tried to follow the reworded sentence, but it seems to lead confusion 
cause;
   
   1) `Otherwise archived files would be considered new source files.` This 
sounds me as it's allowed to violate the rule and the result is this, but the 
goal is that we just don't allow to violate the rule.
   
   2) The point of condition is that we are checking the match with same depth, 
taking minimum, due to the fact explained in PR description. While we would 
want to skip elaborating why, I think we still need to clarify it in doc. I'm 
not sure only mentioning prefix/subdirectory contains the point.
   
   I'll try to add an example after origin sentence.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR commented on a change in pull request #26920: [SPARK-30281][SS] Consider partitioned/recursive option while verifying archive path on FileStreamSource

Reply via email to