kbendick edited a comment on pull request #4307: URL: https://github.com/apache/iceberg/pull/4307#issuecomment-1067032631
> Regarding the automatic detection, by looking into the partition field names, I have one question: would you still ignore all other hidden paths, that are not related to the partitions or disable the HiddenPathFilter completely. For illustration purposes: Given the partition name `_part` and the following paths: > > ``` > -- /data/ > | -- _part=AA > | -- _part=BB > | -- _part=CC > | -- _some-folder > ``` > Would you only include the paths `_part=*` and ignore the path `_some-folder` or include all paths? I originally had the same comment as Ryan, about checking for partitions specifically as that was the source of the original problem. However, thinking on this question, I would think we would still want to allow for removing directories / paths that start with a leading `_` if the option were enabled, even if they aren't part of partitions. In particular, some file output committers use `_` before moving their data over via rename operation, which might be something that needs to be cleaned up using `RemoveOrphanFiles` in case the job fails at a specific time. Please correct me if I'm mistaken about the file output committers as I tend to personally use S3 (and then either S3FileIO or S3A) and the magic output committers if using s3a. But if we're detecting partition paths automatically and not filtering on those, then possibly we'd want to keep this second scenario as an additional option. I have also not historically used HDFS much in production scenarios, so possibly I'm mistaken about the benefits / utility of having hidden paths or how they'd end up in the directory structure of the table (outside of the previously two mentioned situations). Please feel free to let me know if I'm mistaken. 🙂 > Also, where should I document the new option? For documenting the new option, let's worry about that once we get it confirmed into the right place. But definitely let's be sure to document it. I would start with adding it as a JavaDoc comment, assuming that there is a JavaDoc comment for the class already that documents the options for the action (I believe there is). Once we're closer to the final solution, we can also find a place in the docs themselves to put this option if necessary. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org