HeartSaVioR commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r586286229



##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##########
@@ -40,17 +41,31 @@ object FileStreamSink extends Logging {
    * be read.
    */
   def hasMetadata(path: Seq[String], hadoopConf: Configuration, sqlConf: 
SQLConf): Boolean = {
-    path match {
-      case Seq(singlePath) =>
-        val hdfsPath = new Path(singlePath)
-        val fs = hdfsPath.getFileSystem(hadoopConf)
-        if (fs.isDirectory(hdfsPath)) {
-          val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
-          fs.exists(metadataPath)
-        } else {
-          false
-        }
-      case _ => false
+    if (sqlConf.getConf(SQLConf.FILE_SINK_FORMAT_CHECK_ENABLED)) {

Review comment:
       Let's focus on "glob path" here - from the quick look on 
org.apache.hadoop.fs.FileSystem javadoc, both `isDirectory()` and `exists()` 
seem to require the exact path, not glob path. The method which accept glob 
path is `globStatus()`, and there the API clearly names the parameter as 
`pathPattern`.
   
   
https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html
   
   And intuitively, it sounds me as very odd if someone can say the glob path 
is a directory while matching paths can be both files and directories, and same 
for checking existence as well. I don't think it's feasible to "expect" the 
meaningful value from calling these methods with glob path. That sounds to me 
as "undefined behavior" even any weird file system could return true for the 
case.
   
   That said, I'd rather consider the input of glob path as "wrong one" and 
always return false (some sort of debug log message is fine if we really like 
to log). If there's a code relying on such behavior, I don't think that is 
correct. I'd rather say the possible paths should be populated before, and this 
method should be called per each path.
   
   I still need to hear voices from others, but if the consensus goes to just 
disallow the glob path here, we won't need to introduce the new configuration.
   
   @tdas @zsxwing @jose-torres @viirya @gaborgsomogyi Would like to hear your 
voices on this. Thanks!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to