sandip-db commented on code in PR #56374:
URL: https://github.com/apache/spark/pull/56374#discussion_r3425795477


##########
docs/sql-data-sources-generic-options.md:
##########
@@ -97,6 +97,46 @@ you can use:
 </div>
 </div>
 
+### Ignored Path Segment Regex
+
+Spark allows you to use the configuration 
`spark.sql.files.ignoredPathSegmentRegex` or the data source option 
`ignoredPathSegmentRegex` to control which files are treated as
+hidden during file listing. The value is a Java regular expression that is 
matched (with find semantics, i.e. `java.util.regex.Matcher.find`) against each 
individual
+directory and file name below the path being read; names in which the regex 
finds a match are skipped from file listing, partition discovery, and reads, 
and a matching
+directory name excludes its whole subtree. The default value is `^[._]`, which 
skips files and directories whose names start with `_` or `.`. The data source 
option
+takes precedence over the configuration when both are set.
+
+Regardless of the regex, three rules always apply: names starting with 
`_metadata` or `_common_metadata` (Parquet summary files) are always listed, 
names ending in
+`._COPYING_` (in-flight copies) are always skipped, and `_`-prefixed names 
containing `=` (partition directories) are always kept.
+
+A regex that never matches, such as `(?!)`, disables the generic hidden-file 
filtering and surfaces hidden files, including Spark-internal marker files such 
as

Review Comment:
   An issue with empty pattern string ""?



##########
examples/src/main/python/sql/datasource.py:
##########
@@ -67,6 +67,14 @@ def generic_file_source_options_example(spark: SparkSession) 
-> None:
     # |file2.parquet|
     # +-------------+
     # $example off:recursive_file_lookup$
+
+    # $example on:ignored_path_segment_regex$
+    # "(?!)" surfaces files that are hidden by default (e.g. names starting 
with "_" or ".")
+    surfaced_df = spark.read.format("parquet")\
+        .option("ignoredPathSegmentRegex", "(?!)")\

Review Comment:
   Wouldn't empty string value be simpler?



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala:
##########
@@ -154,8 +154,9 @@ object InMemoryFileIndex extends Logging {
       parameters: Map[String, String] = Map.empty): Seq[(Path, 
Seq[FileStatus])] = {
     val fileSystemList =
       
sparkSession.sessionState.conf.useListFilesFileSystemList.split(",").map(_.trim)
-    val ignoreMissingFiles =
-      new FileSourceOptions(CaseInsensitiveMap(parameters)).ignoreMissingFiles
+    val fileSourceOptions = new 
FileSourceOptions(CaseInsensitiveMap(parameters))
+    val ignoreMissingFiles = fileSourceOptions.ignoreMissingFiles
+    val listHiddenFiles = fileSourceOptions.listHiddenFiles

Review Comment:
   We don't need a SQL conf if `FileSourceOptions` is used. And 
`CatalogFileIndex` will ignore `FileSourceOptions`. So we should be ok with 
enabling caching.



##########
docs/sql-migration-guide.md:
##########
@@ -22,6 +22,10 @@ license: |
 * Table of contents
 {:toc}
 
+## Upgrading from Spark SQL 4.2 to 5.0
+
+- Since Spark 5.0, zero-length files are skipped during Parquet schema 
inference instead of failing with a `FAILED_READ_FILE.CANNOT_READ_FILE_FOOTER` 
error.

Review Comment:
   This is unrelated to this PR



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala:
##########
@@ -100,13 +101,16 @@ abstract class CSVDataSource extends Serializable {
    *
    * @param getParser builds a fresh [[UnivocityParser]].
    * @param getHeaderChecker builds a fresh [[CSVHeaderChecker]] for 
`(isStartOfFile, source)`.
+   * @param ignoredPathSegmentRegex the compiled effective 
`ignoredPathSegmentRegex` option, so hidden
+   *                           entries are skipped exactly like Spark's file 
listing would.
    */
   def readArchive(
       conf: Configuration,
       file: PartitionedFile,
       getParser: () => UnivocityParser,
       getHeaderChecker: (Boolean, String) => CSVHeaderChecker,
-      requiredSchema: StructType): Iterator[InternalRow]
+      requiredSchema: StructType,
+      ignoredPathSegmentRegex: Pattern): Iterator[InternalRow]

Review Comment:
   `ignoredPathSegmentRegex` is already available in `FileSourceOptions`. Why 
is there a need to thread it again via these arguments. Any issue with having a 
lazy val in `FileSourceOptions` to compute the `Pattern`?



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala:
##########
@@ -182,8 +194,11 @@ object InMemoryFileIndex extends Logging {
 
 }
 
-private class PathFilterWrapper(val filter: PathFilter) extends PathFilter 
with Serializable {
+private class PathFilterWrapper(
+    val filter: PathFilter,
+    val ignoredPathSegmentRegex: Pattern) extends PathFilter with Serializable 
{
   override def accept(path: Path): Boolean = {
-    (filter == null || filter.accept(path)) && 
!HadoopFSUtils.shouldFilterOutPathName(path.getName)
+    (filter == null || filter.accept(path)) &&
+      !HadoopFSUtils.shouldFilterOutPathName(path.getName, 
ignoredPathSegmentRegex)

Review Comment:
   Wouldn't `listFiles` above already have evaluated the 
`ignoredPathSegmentRegex` against the path name?



##########
docs/sql-migration-guide.md:
##########
@@ -22,6 +22,10 @@ license: |
 * Table of contents
 {:toc}
 
+## Upgrading from Spark SQL 4.2 to 5.0
+
+- Since Spark 5.0, zero-length files are skipped during Parquet schema 
inference instead of failing with a `FAILED_READ_FILE.CANNOT_READ_FILE_FOOTER` 
error.

Review Comment:
   This is unrelated to this PR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to