[jira] [Comment Edited] (SPARK-31911) Using S3A staging committer, pending uploads are committed more than once and listed incorrectly in _SUCCESS data

Brandon (Jira) Fri, 05 Jun 2020 01:54:26 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-31911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126556#comment-17126556
 ]


Brandon edited comment on SPARK-31911 at 6/5/20, 8:53 AM:
----------------------------------------------------------

Interesting, it looks like the staging committer supports a configuration 
parameter `spark.sql.sources.writeJobUUID` that takes precedence over the 
`spark.app.id` for determining the name of the pendingSet directory. This is 
very interesting because `spark.sql.sources.writeJobUUID` is not present 
anywhere in the Spark codebase. Should this be set to a random UUID for each 
write job in Spark?

https://github.com/apache/hadoop/blob/a6df05bf5e24d04852a35b096c44e79f843f4776/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/staging/StagingCommitter.java#L186-L208


was (Author: brandonvin):
Interesting, it looks like the staging committer supports a configuration 
parameter `spark.sql.sources.writeJobUUID` that takes precedence over the 
`spark.app.id` for determining the name of the pendingSet directory. This is 
very interesting because `spark.sql.sources.writeJobUUID` is not present 
anywhere in the Spark codebase. Should this be set to a random UUID for each 
write job in Spark?

> Using S3A staging committer, pending uploads are committed more than once and 
> listed incorrectly in _SUCCESS data
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-31911
>                 URL: https://issues.apache.org/jira/browse/SPARK-31911
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.4.4
>            Reporter: Brandon
>            Priority: Major
>
> First of all thanks for the great work on the S3 committers. I was able set 
> up the directory staging committer in my environment following docs at 
> [https://github.com/apache/spark/blob/master/docs/cloud-integration.md#committing-work-into-cloud-storage-safely-and-fast]
>  and tested one of my Spark applications using it. The Spark version is 2.4.4 
> with Hadoop 3.2.1 and the cloud committer bindings. The application writes 
> multiple DataFrames to ORC/Parquet in S3, submitting them as write jobs to 
> Spark in parallel.
> I think I'm seeing a bug where the staging committer will complete pending 
> uploads more than once. The main symptom how I discovered this is that the 
> _SUCCESS data files under each table will contain overlapping file names that 
> belong to separate tables. From my reading of the code, that's because the 
> filenames in _SUCCESS reflect which multipart uploads were completed in the 
> commit for that particular table.
> An example:
> Concurrently, fire off DataFrame.write.orc("s3a://bucket/a") and 
> DataFrame.write.orc("s3a://bucket/b"). Suppose each table has one partition 
> so writes one partition file.
> When the two writes are done,
>  * /a/_SUCCESS contains two filenames: /a/part-0000 and /b/part-0000.
>  * /b/_SUCCESS contains the same two filenames.
> Setting S3A logs to debug, I see the commitJob operation belonging to table a 
> includes completing the uploads of /a/part-0000 and /b/part-0000. Then again, 
> commitJob for table b includes the same completions. I haven't had a problem 
> yet, but I wonder if having these extra requests would become an issue at 
> higher scale, where dozens of commits with hundreds of files may be happening 
> concurrently in the application.
> I believe this may be caused from the way the pendingSet files are stored in 
> the staging directory. They are stored under one directory named by the 
> jobID, in the Hadoop code. However, for all write jobs executed by the Spark 
> application, the jobID passed to Hadoop is the same - the application ID. 
> Maybe the staging commit algorithm was built on the assumption that each 
> instance of the algorithm would use a unique random jobID.
> [~ste...@apache.org] , [~rdblue] Having seen your names on most of this work 
> (thank you), I would be interested to know your thoughts on this. Also it's 
> my first time opening a bug here, so let me know if there's anything else I 
> can do to help report the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31911) Using S3A staging committer, pending uploads are committed more than once and listed incorrectly in _SUCCESS data

Reply via email to