cloud-fan commented on code in PR #56374:
URL: https://github.com/apache/spark/pull/56374#discussion_r3431777676
##########
docs/sql-migration-guide.md:
##########
@@ -22,6 +22,10 @@ license: |
* Table of contents
{:toc}
+## Upgrading from Spark SQL 4.2 to 5.0
Review Comment:
Same version issue as the config: this should target 4.3.0, not 5.0. Rather
than a new `## Upgrading from Spark SQL 4.2 to 5.0` section, move this bullet
into the existing `## Upgrading from Spark SQL 4.2 to 4.3` section just below,
and change "Since Spark 5.0" to "Since Spark 4.3".
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -2776,6 +2777,27 @@ object SQLConf {
.booleanConf
.createWithDefault(false)
+ val IGNORED_PATH_SEGMENT_REGEX =
buildConf("spark.sql.files.ignoredPathSegmentRegex")
+ .doc("Java regular expression matched (with find semantics) against each
directory and " +
+ "file name during file listing; matching names are skipped from listing,
partition " +
+ "discovery, and reads. The default '^[._]' skips names starting with '_'
or '.' (hidden " +
+ "files). Regardless of the regex, names starting with '_metadata' or
'_common_metadata' " +
+ "are always listed, names ending in '._COPYING_' are always skipped, and
'_'-prefixed " +
+ "names containing '=' (partition directories) are always kept. This
configuration is " +
+ "effective only when using file-based sources such as Parquet, JSON and
ORC. It can be " +
+ "overridden per read by the 'ignoredPathSegmentRegex' data source
option. A regex that never " +
+ "matches (e.g. '(?!)') disables generic hidden-file filtering; note that
directories it " +
+ "surfaces also participate in partition discovery, so a hidden directory
next to " +
+ "partition directories causes a conflicting directory structures error
unless " +
+ "'spark.sql.files.ignoreInvalidPartitionPaths' is enabled.")
+ .version("5.0.0")
Review Comment:
This should be `4.3.0`, not `5.0.0`. This is a normally-backported new
feature (not a breaking change or dependency upgrade), so it ships first in the
next open feature release — `branch-4.x`, currently `4.3.0-SNAPSHOT` — rather
than master's `5.0.0-SNAPSHOT`. (On current master, 11 configs use
`.version("4.3.0")` vs 1 at `5.0.0`.)
```suggestion
.version("4.3.0")
```
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/FileSourceOptions.scala:
##########
@@ -53,9 +53,15 @@ class FileSourceOptions(
* executors. Only the CSV data source currently honors this.
*/
val archiveFormatEnabled: Boolean =
SQLConf.get.getConf(SQLConf.ARCHIVE_FORMAT_READER_ENABLED)
+
+ val ignoredPathSegmentRegex: String =
+ parameters.getOrElse(IGNORED_PATH_SEGMENT_REGEX,
SQLConf.get.ignoredPathSegmentRegex)
Review Comment:
Reinforcing @sandip-db's thread on `CSVDataSource.scala`: this resolved
regex String is re-`Pattern.compile`d at ~5 call sites
(`InMemoryFileIndex.bulkListLeafFiles`, `PartitioningAwareFileIndex`,
`FileTable`, `DataSource`, `CSVFileFormat`) and the `require(nonEmpty)` is
duplicated here and in `PartitioningAwareFileIndex`. Exposing a single lazy
compiled `Pattern` here (and validating it with `Try(Pattern.compile(...))`
like the SQL conf's `checkValue` does) would remove all of that — and as a
bonus give the per-read option the same clear validation error the conf already
produces, instead of the raw `PatternSyntaxException` an invalid option
currently throws deep in listing. Non-blocking.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]