cloud-fan commented on code in PR #56374:
URL: https://github.com/apache/spark/pull/56374#discussion_r3399756168


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/FileSourceOptions.scala:
##########
@@ -53,9 +53,13 @@ class FileSourceOptions(
    * executors. Only the CSV data source currently honors this.
    */
   val archiveFormatEnabled: Boolean = 
SQLConf.get.getConf(SQLConf.ARCHIVE_FORMAT_READER_ENABLED)
+
+  val listHiddenFiles: Boolean = 
parameters.get(LIST_HIDDEN_FILES).map(_.toBoolean)
+    .getOrElse(SQLConf.get.listHiddenFiles)
 }
 
 object FileSourceOptions {
   val IGNORE_CORRUPT_FILES = "ignoreCorruptFiles"
   val IGNORE_MISSING_FILES = "ignoreMissingFiles"
+  val LIST_HIDDEN_FILES = "listHiddenFiles"

Review Comment:
   I'd keep the boolean `listHiddenFiles`.
   
   Re `hiddenFileRegex`: the current filter isn't actually expressible as a 
name-prefix regex, which makes a regex option more misleading than flexible. 
Today's rule also drops `*._COPYING_` by suffix, exempts 
`_metadata`/`_common_metadata` (Parquet summary files), and special-cases 
`_x=y` names (`startsWith("_") && !contains("=")`). So `^[\.\_]` as the 
"behavior-preserving default" is already subtly wrong, and exposing the rule as 
a user-supplied regex either loses these special cases or forces users to 
understand them. The flexibility use case (e.g. surface `_` files but keep `.` 
files hidden) is already covered by combining this option with 
`pathGlobFilter`. If a real need shows up later, a regex option can still be 
added alongside the boolean without breaking anything.
   
   Re `ignoreMetadataFiles`: the filtered set is broader than metadata files (a 
`.foo.json` sidecar isn't metadata), and the default would be `true`, giving 
users a double negative. "Hidden files" is the established Hadoop convention 
(`FileInputFormat.hiddenFileFilter` is exactly the `_`/`.` prefix rule), so 
`listHiddenFiles` describes what the mechanism does.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to