HeartSaVioR commented on issue #22952: [SPARK-20568][SS] Provide option to 
clean up completed files in streaming query
URL: https://github.com/apache/spark/pull/22952#issuecomment-454645347
 
 
   @zsxwing 
   
   > I feel the current behavior of rename is a bit weird. Let's say the source 
dir is `/a/b/c` and sourceArchiveDir is `/a/b/c/_archive`, if the file to 
rename is `/a/b/c/d=1/x.txt`, the final path will be 
`/a/b/c/_archive/a/b/c/d=1/x.txt`. Is there a special reason?
   
   I considered couple of cases including simple approach, but things are going 
to be complicated when we deal with wildcards, especially wildcards come in the 
middle of path. Suppose the source path is provided as `/a/b*/c`, then final 
files which can be consumed could be `/a/b/c/d.txt`, `/a/b1/c/d.txt`, 
`/a/ba/c/d.txt`, etc. We need to be careful to not making them to be conflict 
each other when archiving, so we should ensure the final paths of archives for 
these files should not be same. Hence we can't pick `/c/d.txt` and `/d.txt` 
from these files.
   
   What we can safely get rid of path would be the sub-path from the start and 
longest which glob pattern doesn't occur (in above case, this would be `/a`, so 
shortest paths we can pick from these files would be `/b/c/d.txt`, 
`/b1/c/d.txt`, and `/ba/c/d.txt`.), but I feel that's not intuitive and end 
users should calculate it themselves how path is getting modified.
   
   Picking full path is very intuitive, and doesn't have issue on being 
conflicted (at least files which have same scheme), though I agree it sometimes 
creates (maybe) unnecessary directories. Btw, someone may be able to leverage 
such behavior, having one central archive directory, then source files should 
be archived like archive directory as a new root directory.
   
   What do you think? Do you have a better idea addressing them properly and 
creating simpler path?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to