sandip-db commented on code in PR #56374:
URL: https://github.com/apache/spark/pull/56374#discussion_r3425795477
##########
docs/sql-data-sources-generic-options.md:
##########
@@ -97,6 +97,46 @@ you can use:
</div>
</div>
+### Ignored Path Segment Regex
+
+Spark allows you to use the configuration
`spark.sql.files.ignoredPathSegmentRegex` or the data source option
`ignoredPathSegmentRegex` to control which files are treated as
+hidden during file listing. The value is a Java regular expression that is
matched (with find semantics, i.e. `java.util.regex.Matcher.find`) against each
individual
+directory and file name below the path being read; names in which the regex
finds a match are skipped from file listing, partition discovery, and reads,
and a matching
+directory name excludes its whole subtree. The default value is `^[._]`, which
skips files and directories whose names start with `_` or `.`. The data source
option
+takes precedence over the configuration when both are set.
+
+Regardless of the regex, three rules always apply: names starting with
`_metadata` or `_common_metadata` (Parquet summary files) are always listed,
names ending in
+`._COPYING_` (in-flight copies) are always skipped, and `_`-prefixed names
containing `=` (partition directories) are always kept.
+
+A regex that never matches, such as `(?!)`, disables the generic hidden-file
filtering and surfaces hidden files, including Spark-internal marker files such
as
Review Comment:
An issue with empty pattern string ""?
##########
examples/src/main/python/sql/datasource.py:
##########
@@ -67,6 +67,14 @@ def generic_file_source_options_example(spark: SparkSession)
-> None:
# |file2.parquet|
# +-------------+
# $example off:recursive_file_lookup$
+
+ # $example on:ignored_path_segment_regex$
+ # "(?!)" surfaces files that are hidden by default (e.g. names starting
with "_" or ".")
+ surfaced_df = spark.read.format("parquet")\
+ .option("ignoredPathSegmentRegex", "(?!)")\
Review Comment:
Wouldn't empty string value be simpler?
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala:
##########
@@ -154,8 +154,9 @@ object InMemoryFileIndex extends Logging {
parameters: Map[String, String] = Map.empty): Seq[(Path,
Seq[FileStatus])] = {
val fileSystemList =
sparkSession.sessionState.conf.useListFilesFileSystemList.split(",").map(_.trim)
- val ignoreMissingFiles =
- new FileSourceOptions(CaseInsensitiveMap(parameters)).ignoreMissingFiles
+ val fileSourceOptions = new
FileSourceOptions(CaseInsensitiveMap(parameters))
+ val ignoreMissingFiles = fileSourceOptions.ignoreMissingFiles
+ val listHiddenFiles = fileSourceOptions.listHiddenFiles
Review Comment:
We don't need a SQL conf if `FileSourceOptions` is used. And
`CatalogFileIndex` will ignore `FileSourceOptions`. So we should be ok with
enabling caching.
##########
docs/sql-migration-guide.md:
##########
@@ -22,6 +22,10 @@ license: |
* Table of contents
{:toc}
+## Upgrading from Spark SQL 4.2 to 5.0
+
+- Since Spark 5.0, zero-length files are skipped during Parquet schema
inference instead of failing with a `FAILED_READ_FILE.CANNOT_READ_FILE_FOOTER`
error.
Review Comment:
This is unrelated to this PR
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala:
##########
@@ -100,13 +101,16 @@ abstract class CSVDataSource extends Serializable {
*
* @param getParser builds a fresh [[UnivocityParser]].
* @param getHeaderChecker builds a fresh [[CSVHeaderChecker]] for
`(isStartOfFile, source)`.
+ * @param ignoredPathSegmentRegex the compiled effective
`ignoredPathSegmentRegex` option, so hidden
+ * entries are skipped exactly like Spark's file
listing would.
*/
def readArchive(
conf: Configuration,
file: PartitionedFile,
getParser: () => UnivocityParser,
getHeaderChecker: (Boolean, String) => CSVHeaderChecker,
- requiredSchema: StructType): Iterator[InternalRow]
+ requiredSchema: StructType,
+ ignoredPathSegmentRegex: Pattern): Iterator[InternalRow]
Review Comment:
`ignoredPathSegmentRegex` is already available in `FileSourceOptions`. Why
is there a need to thread it again via these arguments. Any issue with having a
lazy val in `FileSourceOptions` to compute the `Pattern`?
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala:
##########
@@ -182,8 +194,11 @@ object InMemoryFileIndex extends Logging {
}
-private class PathFilterWrapper(val filter: PathFilter) extends PathFilter
with Serializable {
+private class PathFilterWrapper(
+ val filter: PathFilter,
+ val ignoredPathSegmentRegex: Pattern) extends PathFilter with Serializable
{
override def accept(path: Path): Boolean = {
- (filter == null || filter.accept(path)) &&
!HadoopFSUtils.shouldFilterOutPathName(path.getName)
+ (filter == null || filter.accept(path)) &&
+ !HadoopFSUtils.shouldFilterOutPathName(path.getName,
ignoredPathSegmentRegex)
Review Comment:
Wouldn't `listFiles` above already have evaluated the
`ignoredPathSegmentRegex` against the path name?
##########
docs/sql-migration-guide.md:
##########
@@ -22,6 +22,10 @@ license: |
* Table of contents
{:toc}
+## Upgrading from Spark SQL 4.2 to 5.0
+
+- Since Spark 5.0, zero-length files are skipped during Parquet schema
inference instead of failing with a `FAILED_READ_FILE.CANNOT_READ_FILE_FOOTER`
error.
Review Comment:
This is unrelated to this PR
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]