[GitHub] [spark] zsxwing commented on pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

GitBox Thu, 25 Mar 2021 16:45:09 -0700


zsxwing commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-807747189



   I think the issue we try to fix is when a glob path is valid but we cannot 
call `getFileStatus` with it, how to allow users to access batch output.
   
   For example, a glob path can be very long such as 
`/foo/bar/{20200101,20200102,...20201231}` which cannot get accepted by 
`getFileStatus` because most of cloud storages won't accept such a long path. 
However, after we expend the glob path to real file paths, these paths are 
valid paths.
   
   In 2.4, we blindly ignore errors when checking whether a directory has 
`_metadata_log` or not. So the above case works in 2.4. However, in 3.0, we 
don't ignore errors any more, and it exposes this issue.
   
   In addition, a glob path can be a normal path, which makes it much more 
complicated. For example,
   
   ```
     test("foo") {
       withTempDir { tempDir =>
         val path = new java.io.File(tempDir, "[]")
         val chk = new java.io.File(tempDir, "chk")
         val q = 
spark.readStream.format("rate").load().writeStream.format("parquet")
           .trigger(org.apache.spark.sql.streaming.Trigger.Once())
           .option("checkpointLocation", chk.getCanonicalPath)
           .start(path.getCanonicalPath)
         q.awaitTermination()
         assert(SparkHadoopUtil.get.isGlobPath(new Path(path.getCanonicalPath)))
         val fileIndex = 
spark.read.format("parquet").load(path.getCanonicalPath)
           .queryExecution.executedPlan.collectFirst {
           case f: FileSourceScanExec => f.relation.location
         }.head
         assert(fileIndex.isInstanceOf[MetadataLogFileIndex])
       }
     }
   ```
   
   Ideally, batch queries should not be impacted by the extra code from 
streaming queries. But unfortunately that's impossible now. I'm inclined to do 
the initial approach which only skip glob paths when an error is thrown.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] zsxwing commented on pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

Reply via email to