HeartSaVioR commented on issue #22952: [SPARK-20568][SS] Provide option to clean up completed files in streaming query URL: https://github.com/apache/spark/pull/22952#issuecomment-454645347 @zsxwing > I feel the current behavior of rename is a bit weird. Let's say the source dir is `/a/b/c` and sourceArchiveDir is `/a/b/c/_archive`, if the file to rename is `/a/b/c/d=1/x.txt`, the final path will be `/a/b/c/_archive/a/b/c/d=1/x.txt`. Is there a special reason? I considered couple of cases including simple approach, but things are going to be complicated when we deal with wildcards, especially wildcards come in the middle of path. Suppose the source path is provided as `/a/b*/c`, then final files which can be consumed could be `/a/b/c/d.txt`, `/a/b1/c/d.txt`, `/a/ba/c/d.txt`, etc. We need to be careful to not making them to be conflict each other when archiving, so we should ensure the final paths of archives for these files should not be same. Hence we can't pick `/c/d.txt` and `/d.txt` from these files. What we can safely get rid of path would be the sub-path from the start and longest which glob pattern doesn't occur (in above case, this would be `/a`, so shortest paths we can pick from these files would be `/b/c/d.txt`, `/b1/c/d.txt`, and `/ba/c/d.txt`.), but I feel that's not intuitive and end users should calculate it themselves how path is getting modified. Picking full path is very intuitive, and doesn't have issue on being conflicted (at least files which have same scheme), though I agree it sometimes creates (maybe) unnecessary directories. Btw, someone may be able to leverage such behavior, having one central archive directory, then source files should be archived like archive directory as a new root directory. What do you think? Do you have a better idea addressing them properly and creating simpler path?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
