kbendick edited a comment on pull request #4307:
URL: https://github.com/apache/iceberg/pull/4307#issuecomment-1067032631


   > Regarding the automatic detection, by looking into the partition field 
names, I have one question: would you still ignore all other hidden paths, that 
are not related to the partitions or disable the HiddenPathFilter completely. 
For illustration purposes: Given the partition name `_part` and the following 
paths:
   > 
   > ```
   > -- /data/
   >       | --  _part=AA
   >       | --  _part=BB
   >       | --  _part=CC
   >       | --  _some-folder
   > ```
   > Would you only include the paths `_part=*` and ignore the path 
`_some-folder` or include all paths?
   
   I originally had the same comment as Ryan, about checking for partitions 
specifically as that was the source of the original problem.
   
   However, thinking on this question, I would think we would still want to 
allow for removing directories / paths that start with a leading `_` if the 
option were enabled, even if they aren't part of partitions. In particular, 
some file output committers use `_` before moving their data over via rename 
operation, which might be something that needs to be cleaned up using 
`RemoveOrphanFiles` in case the job fails at a specific time.
   
   Please correct me if I'm mistaken about the file output committers as I tend 
to personally use S3 (and then either S3FileIO or S3A) and the magic output 
committers if using s3a. 
   
   But if we're detecting partition paths automatically and not filtering on 
those, then possibly we'd want to keep this second scenario as an additional 
option.
   
   I have also not historically used HDFS much in production scenarios, so 
possibly I'm mistaken about the benefits / utility of having hidden paths or 
how they'd end up in the directory structure of the table (outside of the 
previously two mentioned situations). Please feel free to let me know if I'm 
mistaken. 🙂 
   
   > Also, where should I document the new option?
   
   For documenting the new option, let's worry about that once we get it 
confirmed into the right place. But definitely let's be sure to document it.
   
   I would start with adding it as a JavaDoc comment, assuming that there is a 
JavaDoc comment for the class already that documents the options for the action 
(I believe there is). Once we're closer to the final solution, we can also find 
a place in the docs themselves to put this option if necessary.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to