cloud-fan commented on code in PR #56374:
URL: https://github.com/apache/spark/pull/56374#discussion_r3431777676


##########
docs/sql-migration-guide.md:
##########
@@ -22,6 +22,10 @@ license: |
 * Table of contents
 {:toc}
 
+## Upgrading from Spark SQL 4.2 to 5.0

Review Comment:
   Same version issue as the config: this should target 4.3.0, not 5.0. Rather 
than a new `## Upgrading from Spark SQL 4.2 to 5.0` section, move this bullet 
into the existing `## Upgrading from Spark SQL 4.2 to 4.3` section just below, 
and change "Since Spark 5.0" to "Since Spark 4.3".



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -2776,6 +2777,27 @@ object SQLConf {
     .booleanConf
     .createWithDefault(false)
 
+  val IGNORED_PATH_SEGMENT_REGEX = 
buildConf("spark.sql.files.ignoredPathSegmentRegex")
+    .doc("Java regular expression matched (with find semantics) against each 
directory and " +
+      "file name during file listing; matching names are skipped from listing, 
partition " +
+      "discovery, and reads. The default '^[._]' skips names starting with '_' 
or '.' (hidden " +
+      "files). Regardless of the regex, names starting with '_metadata' or 
'_common_metadata' " +
+      "are always listed, names ending in '._COPYING_' are always skipped, and 
'_'-prefixed " +
+      "names containing '=' (partition directories) are always kept. This 
configuration is " +
+      "effective only when using file-based sources such as Parquet, JSON and 
ORC. It can be " +
+      "overridden per read by the 'ignoredPathSegmentRegex' data source 
option. A regex that never " +
+      "matches (e.g. '(?!)') disables generic hidden-file filtering; note that 
directories it " +
+      "surfaces also participate in partition discovery, so a hidden directory 
next to " +
+      "partition directories causes a conflicting directory structures error 
unless " +
+      "'spark.sql.files.ignoreInvalidPartitionPaths' is enabled.")
+    .version("5.0.0")

Review Comment:
   This should be `4.3.0`, not `5.0.0`. This is a normally-backported new 
feature (not a breaking change or dependency upgrade), so it ships first in the 
next open feature release — `branch-4.x`, currently `4.3.0-SNAPSHOT` — rather 
than master's `5.0.0-SNAPSHOT`. (On current master, 11 configs use 
`.version("4.3.0")` vs 1 at `5.0.0`.)
   ```suggestion
       .version("4.3.0")
   ```



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/FileSourceOptions.scala:
##########
@@ -53,9 +53,15 @@ class FileSourceOptions(
    * executors. Only the CSV data source currently honors this.
    */
   val archiveFormatEnabled: Boolean = 
SQLConf.get.getConf(SQLConf.ARCHIVE_FORMAT_READER_ENABLED)
+
+  val ignoredPathSegmentRegex: String =
+    parameters.getOrElse(IGNORED_PATH_SEGMENT_REGEX, 
SQLConf.get.ignoredPathSegmentRegex)

Review Comment:
   Reinforcing @sandip-db's thread on `CSVDataSource.scala`: this resolved 
regex String is re-`Pattern.compile`d at ~5 call sites 
(`InMemoryFileIndex.bulkListLeafFiles`, `PartitioningAwareFileIndex`, 
`FileTable`, `DataSource`, `CSVFileFormat`) and the `require(nonEmpty)` is 
duplicated here and in `PartitioningAwareFileIndex`. Exposing a single lazy 
compiled `Pattern` here (and validating it with `Try(Pattern.compile(...))` 
like the SQL conf's `checkValue` does) would remove all of that — and as a 
bonus give the per-read option the same clear validation error the conf already 
produces, instead of the raw `PatternSyntaxException` an invalid option 
currently throws deep in listing. Non-blocking.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to