cornel creanga created SPARK-34298:
--------------------------------------
Summary: SaveMode.Overwrite not usable when using s3a root paths
Key: SPARK-34298
URL: https://issues.apache.org/jira/browse/SPARK-34298
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 3.1.2
Reporter: cornel creanga
SaveMode.Overwrite does not work when using paths containing just the root eg
"s3a://peakhour-report". To reproduce the issue (an s3 bucket + credentials are
needed):
{color:#0033b3}val {color}{color:#000000}out {color}=
{color:#067d17}"s3a://peakhour-report"{color}
{color:#0033b3}val {color}{color:#000000}sparkContext{color}:
{color:#000000}SparkContext {color}=
{color:#000000}SparkContext{color}.getOrCreate()
{color:#0033b3}val {color}{color:#000000}someData {color}=
{color:#871094}Seq{color}(Row({color:#1750eb}24{color},
{color:#067d17}"mouse"{color}))
{color:#0033b3}val {color}{color:#000000}someSchema {color}=
{color:#871094}List{color}(StructField({color:#067d17}"age"{color},
{color:#000000}IntegerType{color},
{color:#0033b3}true{color}),StructField({color:#067d17}"word"{color},
{color:#000000}StringType{color},{color:#0033b3}true{color}))
{color:#0033b3}val {color}{color:#000000}someDF {color}=
{color:#871094}spark{color}.createDataFrame(
{color:#871094}spark{color}.sparkContext.parallelize({color:#000000}someData{color}),StructType({color:#000000}someSchema{color}))
{color:#000000}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.access.key"{color},
accessK{color:#000000}ey{color}))
{color:#000000}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.secret.key"{color},
{color:#000000}secretKey{color}))
{color:#000000}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.aws.credentials.provider"{color},
{color:#067d17}"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"{color})
{color:#000000}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.impl"{color},
{color:#067d17}"org.apache.hadoop.fs.s3a.S3AFileSystem"{color})
{color:#000000}someDF{color}.write.format({color:#067d17}"parquet"{color}).partitionBy({color:#067d17}"age"{color}).mode({color:#000000}SaveMode{color}.{color:#871094}Overwrite{color})
.save({color:#000000}out{color})
Error stacktrace:
Exception in thread "main" java.lang.IllegalArgumentException: Can not create a
Path from an empty string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:168)[....]
at org.apache.hadoop.fs.Path.suffix(Path.java:446)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions(InsertIntoHadoopFsRelationCommand.scala:240)
If you change out from {color:#0033b3}val {color}{color:#000000}out {color}=
{color:#067d17}"s3a://peakhour-report"{color} to {color:#0033b3}val
{color}{color:#000000}out {color}=
{color:#067d17}"s3a://peakhour-report/folder" {color:#172b4d}the code
works.{color}{color}
{color:#067d17}{color:#172b4d}There are two problems in the actual code from
InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions: {color}{color}
{color:#067d17}{color:#172b4d}a) it uses org.apache.hadoop.fs.Path.suffix
method that doesn't work on root paths
{color}{color}
{color:#067d17}{color:#172b4d}b) it tries to delete the root folder directly
(in our case the s3 bucket name) and this is prohibited (in the S3AFileSystem
class){color}{color}
{color:#067d17}{color:#172b4d}I think that there are two choices:{color}{color}
{color:#067d17}{color:#172b4d}a) throw an explicit error when using overwrite
mode for root folders {color}{color}
{color:#067d17}{color:#172b4d}b)fix the actual issue. don't use the Path.suffix
method and change the clean up code from
InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions to list the root
folder content and delete the entries one by one.{color}{color}
I can provide a patch for both choices, assuming that they make sense.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]