cloud-fan commented on code in PR #56374:
URL: https://github.com/apache/spark/pull/56374#discussion_r3399757994


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala:
##########
@@ -154,8 +154,9 @@ object InMemoryFileIndex extends Logging {
       parameters: Map[String, String] = Map.empty): Seq[(Path, 
Seq[FileStatus])] = {
     val fileSystemList =
       
sparkSession.sessionState.conf.useListFilesFileSystemList.split(",").map(_.trim)
-    val ignoreMissingFiles =
-      new FileSourceOptions(CaseInsensitiveMap(parameters)).ignoreMissingFiles
+    val fileSourceOptions = new 
FileSourceOptions(CaseInsensitiveMap(parameters))
+    val ignoreMissingFiles = fileSourceOptions.ignoreMissingFiles
+    val listHiddenFiles = fileSourceOptions.listHiddenFiles

Review Comment:
   This interacts badly with `FileStatusCache`: cache entries are keyed by path 
only, but the listing they hold now depends on `listHiddenFiles`. Path-based 
reads are safe (each read gets a fresh cache client), but `CatalogFileIndex` 
keeps one cache client for the table's lifetime and resolves the flag from the 
session conf at listing time — so toggling `spark.sql.files.listHiddenFiles` 
between queries on a catalog table silently returns the listing cached under 
the old value (in both directions) until `REFRESH TABLE` or eviction.
   
   I'd bypass the cache (skip both `getLeafFiles` and `putLeafFiles` in 
`listLeafFiles`) when `listHiddenFiles` is true, so cached entries always hold 
the canonical hidden-filtered listing. That's simpler than widening the cache 
key, and re-listing on an opt-in flag is an acceptable cost.



##########
core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala:
##########
@@ -342,8 +356,12 @@ private[spark] object HadoopFSUtils extends Logging {
   }
   // scalastyle:on argcount
 
-  /** Checks if we should filter out this path name. */
-  def shouldFilterOutPathName(pathName: String): Boolean = {
+  /**
+   * Checks if we should filter out this path name. When `listHiddenFiles` is 
true, nothing is
+   * considered hidden and this always returns false.
+   */
+  def shouldFilterOutPathName(pathName: String, listHiddenFiles: Boolean = 
false): Boolean = {

Review Comment:
   Two production callers still rely on the default `false`:
   
   - `ArchiveReader.shouldSkipEntry` (ArchiveReader.scala:159) — its Scaladoc 
promises parity between archive-entry filtering and loose-file listing, but 
with the option on, hidden entries inside a tar are still skipped while the 
same files read loose are surfaced.
   - `DataSource.checkAndGlobPathIfNecessary` (DataSource.scala:847) — logs 
"All paths were ignored" for explicitly-passed hidden paths even when the 
option surfaces them.
   
   Worth threading the flag to both, or at least scoping the ArchiveReader 
Scaladoc claim to default listing.



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetUtils.scala:
##########
@@ -143,7 +143,10 @@ object ParquetUtils extends Logging {
     val leaves = allFiles.toArray.sortBy(_.getPath.toString)
 
     FileTypes(
-      data = leaves.filterNot(f => 
isSummaryFile(f.getPath)).toImmutableArraySeq,
+      // Zero-length files (e.g. `_SUCCESS` markers surfaced by the 
`listHiddenFiles` option)
+      // cannot contain a Parquet footer, so exclude them from schema 
inference candidates.
+      data = leaves.filter(_.getLen > 0).filterNot(f => 
isSummaryFile(f.getPath))

Review Comment:
   Since this changes behavior on the default path too (a 0-byte parquet file 
used to fail schema inference with a footer error and is now silently skipped), 
it deserves an entry in `docs/sql-migration-guide.md`.



##########
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala:
##########
@@ -1394,6 +1394,89 @@ class FileBasedDataSourceSuite extends SharedSparkSession
       }
     }
   }
+
+  // Asserts the listHiddenFiles contract for `format`: default hides the 
'_'-prefixed file; the
+  // per-read option and the session conf each surface it; the option 
overrides the conf.
+  private def checkListHiddenFilesContract(format: String, basePath: String): 
Unit = {
+    // Default: the hidden file is filtered out.
+    checkAnswer(spark.read.format(format).load(basePath), Seq(Row("visible")))
+
+    // Per-read option surfaces the hidden data.
+    checkAnswer(
+      spark.read.option("listHiddenFiles", 
"true").format(format).load(basePath),
+      Seq(Row("visible"), Row("hidden")))
+
+    // Session conf surfaces the hidden data too.
+    withSQLConf(SQLConf.LIST_HIDDEN_FILES.key -> "true") {
+      checkAnswer(
+        spark.read.format(format).load(basePath),
+        Seq(Row("visible"), Row("hidden")))
+    }
+
+    // Precedence: the per-read option overrides the session conf in both 
directions.
+    withSQLConf(SQLConf.LIST_HIDDEN_FILES.key -> "false") {
+      checkAnswer(
+        spark.read.option("listHiddenFiles", 
"true").format(format).load(basePath),
+        Seq(Row("visible"), Row("hidden")))
+    }
+    withSQLConf(SQLConf.LIST_HIDDEN_FILES.key -> "true") {
+      checkAnswer(
+        spark.read.option("listHiddenFiles", 
"false").format(format).load(basePath),
+        Seq(Row("visible")))
+    }
+  }
+
+  test("Option listHiddenFiles: surfaces hidden files only when enabled - 
json") {
+    withTempDir { dir =>
+      val basePath = dir.getCanonicalPath
+      // Write JSON files directly: a visible record and a '_'-prefixed hidden 
record.
+      Files.write(new File(dir, "visible.json").toPath, 
"""{"a":"visible"}""".getBytes)
+      Files.write(new File(dir, "_hidden.json").toPath, 
"""{"a":"hidden"}""".getBytes)
+
+      checkListHiddenFilesContract("json", basePath)
+    }
+  }
+
+  test("Option listHiddenFiles: surfaces hidden files only when enabled - 
parquet") {
+    withTempDir { dir =>
+      // Copy only the part-*.parquet from a scratch write into basePath -- 
this drops Spark's own
+      // _SUCCESS / _temporary marker files, so listHiddenFiles surfaces only 
the intended hidden
+      // file. Flat layout (files directly under basePath) mirrors the JSON 
test.
+      val basePath = dir.getCanonicalPath
+
+      def copyPartFile(value: String, destName: String): Unit = {
+        withTempDir { scratch =>
+          // withTempDir pre-creates the directory, so overwrite to avoid 
PATH_ALREADY_EXISTS.
+          Seq(value).toDF("a").write.format("parquet").mode("overwrite")
+            .save(scratch.getCanonicalPath)
+          val partFile = scratch.listFiles()
+            .find(f => f.isFile && f.getName.startsWith("part-") && 
f.getName.endsWith(".parquet"))
+            .getOrElse(fail(s"no part file produced under 
${scratch.getCanonicalPath}"))
+          Files.copy(partFile.toPath, new File(dir, destName).toPath)
+        }
+      }
+      copyPartFile("visible", "visible.parquet")
+      copyPartFile("hidden", "_hidden.parquet")
+
+      checkListHiddenFilesContract("parquet", basePath)
+    }
+  }
+
+  test("Option listHiddenFiles: reading an unmodified Spark-written parquet 
directory works") {

Review Comment:
   Consider also covering the partitioned layout (`_SUCCESS` at the table root 
next to `part=` dirs) — the most common Spark-written shape. I traced that it 
works (`parsePartition` stops at the base path, so the root `_SUCCESS` doesn't 
create a conflicting base path), but that's exactly the behavior this feature 
most needs to lock in with a test.



##########
docs/sql-data-sources-generic-options.md:
##########
@@ -97,6 +97,16 @@ you can use:
 </div>
 </div>
 
+### List Hidden Files

Review Comment:
   The sibling sections (Ignore Corrupt Files, Path Glob Filter, Recursive File 
Lookup) all carry runnable code-tab examples; consider adding one here for 
consistency.



##########
docs/sql-data-sources-generic-options.md:
##########
@@ -97,6 +97,16 @@ you can use:
 </div>
 </div>
 
+### List Hidden Files
+
+Spark allows you to use the configuration `spark.sql.files.listHiddenFiles` or 
the data source option `listHiddenFiles` to include hidden files while reading 
data
+from files. Its default value is `false`, which means files and directories 
whose names start with `_` or `.` are skipped during
+file listing. When set to true, such hidden files and directories participate 
in file listing, partition discovery, and reads.
+The data source option takes precedence over the configuration when both are 
set.
+
+Note that enabling this option also surfaces Spark-internal marker files, such 
as `_SUCCESS`, `._COPYING_`, and files under

Review Comment:
   `._COPYING_` files come from Hadoop FS shell copies (`hadoop fs 
-put`/`-cp`), not Spark:
   ```suggestion
   Note that enabling this option also surfaces internal marker and temporary 
files, such as `_SUCCESS`, Hadoop's `._COPYING_` in-progress copy files, and 
files under
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to