zsxwing commented on a change in pull request #23733: [SPARK-26824][SS]Fix the
checkpoint location and _spark_metadata when it contains special chars
URL: https://github.com/apache/spark/pull/23733#discussion_r256112441
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##########
@@ -37,23 +39,47 @@ object FileStreamSink extends Logging {
* Returns true if there is a single path that has a metadata log indicating
which files should
* be read.
*/
- def hasMetadata(path: Seq[String], hadoopConf: Configuration): Boolean = {
+ def hasMetadata(path: Seq[String], hadoopConf: Configuration, sqlConf:
SQLConf): Boolean = {
path match {
case Seq(singlePath) =>
+ val hdfsPath = new Path(singlePath)
+ val fs = hdfsPath.getFileSystem(hadoopConf)
+ if (fs.isDirectory(hdfsPath)) {
+ val metadataPath = new Path(hdfsPath, metadataDir)
+ checkEscapedMetadataPath(fs, metadataPath, sqlConf)
+ fs.exists(metadataPath)
+ } else {
+ false
+ }
+ case _ => false
+ }
+ }
+
+ def checkEscapedMetadataPath(fs: FileSystem, metadataPath: Path, sqlConf:
SQLConf): Unit = {
+ if
(sqlConf.getConf(SQLConf.STREAMING_CHECKPOINT_ESCAPED_PATH_CHECK_ENABLED)
+ && StreamExecution.containsSpecialCharsInPath(metadataPath)) {
+ val legacyMetadataPath = new Path(metadataPath.toUri.toString)
+ val legacyMetadataPathExists =
try {
- val hdfsPath = new Path(singlePath)
- val fs = hdfsPath.getFileSystem(hadoopConf)
- if (fs.isDirectory(hdfsPath)) {
- fs.exists(new Path(hdfsPath, metadataDir))
- } else {
- false
- }
+ fs.exists(legacyMetadataPath)
} catch {
case NonFatal(e) =>
- logWarning(s"Error while looking for metadata directory.")
+ // We may not have access to this directory. Don't fail the query
if that happens.
+ logWarning(e.getMessage, e)
Review comment:
That's correct.
The reason I chose to ignore the error here is we may not have the
permission to check the directory. For example, if I have some special chars in
my user name such as `foo bar`. Then I try to write into my home directory
`/user/foo bar/a/b/c` in **Spark 3.0.0**, it's likely I don't have access to
`/user/foo%20bar/a/b/c` since that's in a different user home directory. Since
checking the directory should be best effort and should not impact any users
that don't hit this path issue, I prefer to ignore the error for safety.
This is different than `try-catch` in `hasMetadata` which is checking a sub
directory in the current directory. In this case, the directory is usually
accessible and the error is probably a real issue.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]