[
https://issues.apache.org/jira/browse/HADOOP-17112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152001#comment-17152001
]
Krzysztof Adamski commented on HADOOP-17112:
--------------------------------------------
Thanks.
The code that produces the error
{code:java}
spark.read.csv('s3a://XXX/XXX/data/test_file1.txt').write.format('csv').save('s3a://XXX/XXX/output
path/csv')
{code}
and stacktrace
{code:java}
Py4JJavaError Traceback (most recent call last)
in
----> 1
spark.read.csv('s3a://XXX/XXX/data/test_file1.txt').write.format('csv').save('s3a://XXX/XXX/output
path/csv')
/usr/local/spark-3.0.1-SNAPSHOT-bin-wbaa-yarn/python/lib/pyspark.zip/pyspark/sql/readwriter.py
in save(self, path, format, mode, partitionBy, **options)
825 self._jwrite.save()
826 else:
--> 827 self._jwrite.save(path)
828
829 @since(1.4)
/usr/local/second-app-dir/venvpy3/lib/python3.6/site-packages/py4j/java_gateway.py
in call(self, *args)
1303 answer = self.gateway_client.send_command(command)
1304 return_value = get_return_value(
-> 1305 answer, self.gateway_client, self.target_id, self.name)
1306
1307 for temp_arg in temp_args:
/usr/local/spark-3.0.1-SNAPSHOT-bin-wbaa-yarn/python/lib/pyspark.zip/pyspark/sql/utils.py
in deco(*a, **kw)
126 def deco(*a, **kw):
127 try:
--> 128 return f(*a, **kw)
129 except py4j.protocol.Py4JJavaError as e:
130 converted = convert_exception(e.java_exception)
/usr/local/second-app-dir/venvpy3/lib/python3.6/site-packages/py4j/protocol.py
in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o86.save.
: org.apache.spark.SparkException: Job aborted.
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:226)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:178)
at
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
at
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
at
org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:131)
at
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
at
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
at
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121)
at
org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:944)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
at
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:944)
at
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:396)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:380)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:269)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: Cannot parse URI
s3a://XXX/XXX/output
path/csv/part-00001-182a6744-a467-4225-a09e-e2e305a66a4f-c000-application_1592135134673_20377.csv
at
org.apache.hadoop.fs.s3a.commit.files.SinglePendingCommit.destinationPath(SinglePendingCommit.java:255)
at
org.apache.hadoop.fs.s3a.commit.files.SinglePendingCommit.validate(SinglePendingCommit.java:195)
at
org.apache.hadoop.fs.s3a.commit.files.PendingSet.validate(PendingSet.java:146)
at org.apache.hadoop.fs.s3a.commit.files.PendingSet.load(PendingSet.java:109)
at
org.apache.hadoop.fs.s3a.commit.AbstractS3ACommitter.lambda$loadPendingsetFiles$4(AbstractS3ACommitter.java:478)
at org.apache.hadoop.fs.s3a.commit.Tasks$Builder$1.run(Tasks.java:254)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
{code}
> whitespace not allowed in paths when saving files to s3a via committer
> ----------------------------------------------------------------------
>
> Key: HADOOP-17112
> URL: https://issues.apache.org/jira/browse/HADOOP-17112
> Project: Hadoop Common
> Issue Type: Sub-task
> Affects Versions: 3.2.0
> Reporter: Krzysztof Adamski
> Priority: Major
> Attachments: image-2020-07-03-16-08-52-340.png
>
>
> When saving results through spark dataframe on latest 3.0.1-snapshot compiled
> against hadoop-3.2 with the following specs
> --conf
> spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a=org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
>
> --conf
> spark.sql.parquet.output.committer.class=org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
>
> --conf
> spark.sql.sources.commitProtocolClass=org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
>
> --conf spark.hadoop.fs.s3a.committer.name=partitioned
> --conf spark.hadoop.fs.s3a.committer.staging.conflict-mode=replace
> we are unable to save the file with whitespace character in the path. It
> works fine without.
> I was looking into the recent commits with regards to qualifying the path,
> but couldn't find anything obvious. Is this a known bug?
> When saving results through spark dataframe on latest 3.0.1-snapshot compiled
> against hadoop-3.2 with the following specs
> --conf
> spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a=org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
>
> --conf
> spark.sql.parquet.output.committer.class=org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
>
> --conf
> spark.sql.sources.commitProtocolClass=org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
>
> --conf spark.hadoop.fs.s3a.committer.name=partitioned
> --conf spark.hadoop.fs.s3a.committer.staging.conflict-mode=replace
> we are unable to save the file with whitespace character in the path. It
> works fine without.
> I was looking into the recent commits with regards to qualifying the path,
> but couldn't find anything obvious. Is this a known bug?
> !image-2020-07-03-16-08-52-340.png!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]