CalvQ commented on code in PR #56374: URL: https://github.com/apache/spark/pull/56374#discussion_r3438872335
########## docs/sql-data-sources-generic-options.md: ########## @@ -97,6 +97,46 @@ you can use: </div> </div> +### Ignored Path Segment Regex + +Spark allows you to use the configuration `spark.sql.files.ignoredPathSegmentRegex` or the data source option `ignoredPathSegmentRegex` to control which files are treated as +hidden during file listing. The value is a Java regular expression that is matched (with find semantics, i.e. `java.util.regex.Matcher.find`) against each individual +directory and file name below the path being read; names in which the regex finds a match are skipped from file listing, partition discovery, and reads, and a matching +directory name excludes its whole subtree. The default value is `^[._]`, which skips files and directories whose names start with `_` or `.`. The data source option +takes precedence over the configuration when both are set. + +Regardless of the regex, three rules always apply: names starting with `_metadata` or `_common_metadata` (Parquet summary files) are always listed, names ending in +`._COPYING_` (in-flight copies) are always skipped, and `_`-prefixed names containing `=` (partition directories) are always kept. + +A regex that never matches, such as `(?!)`, disables the generic hidden-file filtering and surfaces hidden files, including Spark-internal marker files such as Review Comment: Explained https://github.com/apache/spark/pull/56374#discussion_r3438860093, but the empty pattern string `""` actually matches every string, meaning we would filter out everything. Currently edge-casing it to match nothing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
