[
https://issues.apache.org/jira/browse/HADOOP-17066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126934#comment-17126934
]
Brandon commented on HADOOP-17066:
----------------------------------
Hm, I haven't been able to find the option to attach files here, so I'll just
paste:
{noformat}
20/06/05 07:24:39 INFO ParquetFileFormat: Using user defined output committer
for Parquet: org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
20/06/05 07:24:39 DEBUG AbstractS3ACommitterFactory: Committer option is
directory
20/06/05 07:24:39 DEBUG AbstractS3ACommitter: Task committer
attempt_20200605072439_0000_m_000000_0 instantiated for job "(anonymous)" ID
job_20200605072439_0000 with destination
s3://bucket/spark/tables/planex/table-20200605-26635-TbaguBgC8ef/other_namespace/test_a.orc
20/06/05 07:24:39 DEBUG AbstractS3ACommitter: Setting work path to
file:/opt/spark-2.4.4-without-hadoop-scala-2.12/work/driver-20200605072356-0020/s3a/app-20200605072420-0016/_temporary/0/_temporary/attempt_20200605072439_0000_m_000000_0
20/06/05 07:24:39 DEBUG AbstractS3ACommitterFactory: Committer option is
directory
20/06/05 07:24:39 DEBUG AbstractS3ACommitter: Task committer
attempt_20200605072439_0000_m_000000_0 instantiated for job "(anonymous)" ID
job_20200605072439_0000 with destination
s3://bucket/spark/tables/planex/table-20200605-26635-TbaguBgC8ef/test_b.parquet
20/06/05 07:24:39 DEBUG StagingCommitter: Task committer
attempt_20200605072439_0000_m_000000_0: final output path is
s3://bucket/spark/tables/planex/table-20200605-26635-TbaguBgC8ef/other_namespace/test_a.orc
20/06/05 07:24:39 DEBUG StagingCommitter: Conflict resolution mode: FAIL
20/06/05 07:24:39 INFO AbstractS3ACommitterFactory: Using committer directory
to output data to
s3://bucket/spark/tables/planex/table-20200605-26635-TbaguBgC8ef/other_namespace/test_a.orc
20/06/05 07:24:39 INFO AbstractS3ACommitterFactory: Using Commmitter
StagingCommitter{AbstractS3ACommitter{role=Task committer
attempt_20200605072439_0000_m_000000_0, name=directory,
outputPath=s3://bucket/spark/tables/planex/table-20200605-26635-TbaguBgC8ef/other_namespace/test_a.orc,
workPath=file:/opt/spark-2.4.4-without-hadoop-scala-2.12/work/driver-20200605072356-0020/s3a/app-20200605072420-0016/_temporary/0/_temporary/attempt_20200605072439_0000_m_000000_0},
conflictResolution=FAIL,
wrappedCommitter=FileOutputCommitter{PathOutputCommitter{context=TaskAttemptContextImpl{JobContextImpl{jobId=job_20200605072439_0000};
taskId=attempt_20200605072439_0000_m_000000_0, status=''};
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter@68b8576f};
outputPath=hdfs://namenode/s3a-staging-tmp/www-data/app-20200605072420-0016/staging-uploads,
workPath=null, algorithmVersion=1, skipCleanup=false,
ignoreCleanupFailures=false}} for
s3://bucket/spark/tables/planex/table-20200605-26635-TbaguBgC8ef/other_namespace/test_a.orc
20/06/05 07:24:39 DEBUG AbstractS3ACommitter: Setting work path to
file:/opt/spark-2.4.4-without-hadoop-scala-2.12/work/driver-20200605072356-0020/s3a/app-20200605072420-0016/_temporary/0/_temporary/attempt_20200605072439_0000_m_000000_0
20/06/05 07:24:39 DEBUG StagingCommitter: Task committer
attempt_20200605072439_0000_m_000000_0, Setting up job (no job ID)
20/06/05 07:24:39 DEBUG StagingCommitter: Task committer
attempt_20200605072439_0000_m_000000_0: final output path is
s3://bucket/spark/tables/planex/table-20200605-26635-TbaguBgC8ef/test_b.parquet
20/06/05 07:24:39 DEBUG StagingCommitter: Conflict resolution mode: FAIL
20/06/05 07:24:39 INFO AbstractS3ACommitterFactory: Using committer directory
to output data to
s3://bucket/spark/tables/planex/table-20200605-26635-TbaguBgC8ef/test_b.parquet
20/06/05 07:24:39 INFO AbstractS3ACommitterFactory: Using Commmitter
StagingCommitter{AbstractS3ACommitter{role=Task committer
attempt_20200605072439_0000_m_000000_0, name=directory,
outputPath=s3://bucket/spark/tables/planex/table-20200605-26635-TbaguBgC8ef/test_b.parquet,
workPath=file:/opt/spark-2.4.4-without-hadoop-scala-2.12/work/driver-20200605072356-0020/s3a/app-20200605072420-0016/_temporary/0/_temporary/attempt_20200605072439_0000_m_000000_0},
conflictResolution=FAIL,
wrappedCommitter=FileOutputCommitter{PathOutputCommitter{context=TaskAttemptContextImpl{JobContextImpl{jobId=job_20200605072439_0000};
taskId=attempt_20200605072439_0000_m_000000_0, status=''};
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter@40c38a};
outputPath=hdfs://namenode/s3a-staging-tmp/www-data/app-20200605072420-0016/staging-uploads,
workPath=null, algorithmVersion=1, skipCleanup=false,
ignoreCleanupFailures=false}} for
s3://bucket/spark/tables/planex/table-20200605-26635-TbaguBgC8ef/test_b.parquet
20/06/05 07:24:39 DEBUG StagingCommitter: Task committer
attempt_20200605072439_0000_m_000000_0, Setting up job (no job ID)
20/06/05 07:24:39 INFO SparkContext: Starting job: save at build.clj:466
20/06/05 07:24:39 INFO DAGScheduler: Got job 4 (save at build.clj:466) with 1
output partitions
20/06/05 07:24:39 INFO DAGScheduler: Final stage: ResultStage 8 (save at
build.clj:466)
20/06/05 07:24:39 INFO DAGScheduler: Parents of final stage: List()
20/06/05 07:24:39 INFO DAGScheduler: Missing parents: List()
20/06/05 07:24:39 INFO DAGScheduler: Submitting ResultStage 8 (CoalescedRDD[67]
at save at build.clj:466), which has no missing parents
20/06/05 07:24:39 INFO SparkContext: Starting job: save at build.clj:466
20/06/05 07:24:39 INFO SparkContext: Created broadcast 8 from broadcast at
DAGScheduler.scala:1161
20/06/05 07:24:39 INFO DAGScheduler: Submitting 1 missing tasks from
ResultStage 8 (CoalescedRDD[67] at save at build.clj:466) (first 15 tasks are
for partitions Vector(0))
20/06/05 07:24:39 INFO TaskSchedulerImpl: Adding task set 8.0 with 1 tasks
20/06/05 07:24:39 INFO DAGScheduler: Got job 5 (save at build.clj:466) with 1
output partitions
20/06/05 07:24:39 INFO DAGScheduler: Final stage: ResultStage 9 (save at
build.clj:466)
20/06/05 07:24:39 INFO DAGScheduler: Parents of final stage: List()
20/06/05 07:24:39 INFO DAGScheduler: Missing parents: List()
20/06/05 07:24:39 INFO DAGScheduler: Submitting ResultStage 9 (CoalescedRDD[63]
at save at build.clj:466), which has no missing parents
20/06/05 07:24:39 INFO SparkContext: Created broadcast 9 from broadcast at
DAGScheduler.scala:1161
20/06/05 07:24:39 INFO DAGScheduler: Submitting 1 missing tasks from
ResultStage 9 (CoalescedRDD[63] at save at build.clj:466) (first 15 tasks are
for partitions Vector(0))
20/06/05 07:24:39 INFO TaskSchedulerImpl: Adding task set 9.0 with 1 tasks
20/06/05 07:24:41 INFO TaskSchedulerImpl: Removed TaskSet 8.0, whose tasks have
all completed, from pool
20/06/05 07:24:41 INFO DAGScheduler: ResultStage 8 (save at build.clj:466)
finished in 1.624 s
20/06/05 07:24:41 INFO DAGScheduler: Job 4 finished: save at build.clj:466,
took 1.628787 s
20/06/05 07:24:41 INFO AbstractS3ACommitter: Starting: Task committer
attempt_20200605072439_0000_m_000000_0: commitJob((no job ID))
20/06/05 07:24:41 INFO TaskSchedulerImpl: Removed TaskSet 9.0, whose tasks have
all completed, from pool
20/06/05 07:24:41 INFO DAGScheduler: ResultStage 9 (save at build.clj:466)
finished in 1.602 s
20/06/05 07:24:41 INFO DAGScheduler: Job 5 finished: save at build.clj:466,
took 1.635180 s
20/06/05 07:24:41 INFO AbstractS3ACommitter: Starting: Task committer
attempt_20200605072439_0000_m_000000_0: commitJob((no job ID))
20/06/05 07:24:41 DEBUG AbstractS3ACommitter: Task committer
attempt_20200605072439_0000_m_000000_0: creating thread pool of size 8
20/06/05 07:24:41 DEBUG AbstractS3ACommitter: Task committer
attempt_20200605072439_0000_m_000000_0: creating thread pool of size 8
20/06/05 07:24:41 DEBUG Tasks: Executing task
20/06/05 07:24:41 DEBUG Tasks: Executing task
20/06/05 07:24:41 DEBUG Tasks: Waiting for 2 tasks to complete
20/06/05 07:24:41 DEBUG Tasks: Executing task
20/06/05 07:24:41 DEBUG Tasks: Waiting for 2 tasks to complete
20/06/05 07:24:41 DEBUG Tasks: Executing task
20/06/05 07:24:41 DEBUG PendingSet: Reading pending commits in file
hdfs://namenode/s3a-staging-tmp/www-data/app-20200605072420-0016/staging-uploads/_temporary/0/task_20200605072439_0008_m_000000
20/06/05 07:24:41 DEBUG PendingSet: Reading pending commits in file
hdfs://namenode/s3a-staging-tmp/www-data/app-20200605072420-0016/staging-uploads/_temporary/0/task_20200605072439_0009_m_000000
20/06/05 07:24:41 DEBUG PendingSet: Reading pending commits in file
hdfs://namenode/s3a-staging-tmp/www-data/app-20200605072420-0016/staging-uploads/_temporary/0/task_20200605072439_0009_m_000000
20/06/05 07:24:41 DEBUG PendingSet: Reading pending commits in file
hdfs://namenode/s3a-staging-tmp/www-data/app-20200605072420-0016/staging-uploads/_temporary/0/task_20200605072439_0008_m_000000
20/06/05 07:24:41 DEBUG Tasks: Task succeeded
20/06/05 07:24:41 DEBUG Tasks: Task succeeded
20/06/05 07:24:41 DEBUG Tasks: Task succeeded
20/06/05 07:24:41 DEBUG Tasks: Task succeeded
20/06/05 07:24:41 DEBUG Tasks: Finished count -> 2/2
20/06/05 07:24:41 DEBUG Tasks: Finished count -> 2/2
20/06/05 07:24:41 DEBUG AbstractS3ACommitter: Task committer
attempt_20200605072439_0000_m_000000_0: committing the output of 2 task(s)
20/06/05 07:24:41 DEBUG AbstractS3ACommitter: Task committer
attempt_20200605072439_0000_m_000000_0: committing the output of 2 task(s)
20/06/05 07:24:41 DEBUG Tasks: Executing task
20/06/05 07:24:41 DEBUG Tasks: Executing task
20/06/05 07:24:41 DEBUG CommitOperations: Committing single commit
DelayedCompleteData{version=1,
uri='s3://bucket/spark/tables/planex/table-20200605-26635-TbaguBgC8ef/test_b.parquet/part-00000-171fd21c-4d3c-4c41-a329-e8d1e9ffd108-c000-app-20200605072420-0016.snappy.parquet',
destination='spark/tables/planex/table-20200605-26635-TbaguBgC8ef/test_b.parquet/part-00000-171fd21c-4d3c-4c41-a329-e8d1e9ffd108-c000-app-20200605072420-0016.snappy.parquet',
uploadId='SnSQDLeughVPLcrE1KaqPs5LsxCKirUhNpZPQXa5zqCAcYMfD6pLi6DaUVTawZWl9eii0kjxGht7ZG1JfyhqGOvBe99bE.7JH0oAZ5AzaKTggb.eecGL1vfccc1aV7Wg',
created=1591341881205, saved=1591341881205, size=1069, date='Fri Jun 05
07:24:41 UTC 2020', jobId='', taskId='', notes='',
etags=[00c2433b77ce079387d52ebb343e09c1]}
20/06/05 07:24:41 DEBUG CommitOperations: Committing single commit
DelayedCompleteData{version=1,
uri='s3://bucket/spark/tables/planex/table-20200605-26635-TbaguBgC8ef/other_namespace/test_a.orc/part-00000-0e2def2a-b7fe-480f-bd3b-5b9cbd6decf8-c000-app-20200605072420-0016.zlib.orc',
destination='spark/tables/planex/table-20200605-26635-TbaguBgC8ef/other_namespace/test_a.orc/part-00000-0e2def2a-b7fe-480f-bd3b-5b9cbd6decf8-c000-app-20200605072420-0016.zlib.orc',
uploadId='7hquC9BXz_1Hni0aK3n_577vRxoGTHz5ulkKoQ7BxhZJpqnpXOimyBd3xhZSabs1bnkKHt_WocUtSHFVEb2fO_98lsdyN4P.hby4rCn38wMxJGZMTC8Gl6I8UT5ZpoUp',
created=1591341880822, saved=1591341880822, size=597, date='Fri Jun 05
07:24:40 UTC 2020', jobId='', taskId='', notes='',
etags=[fbefef72e70eb0a25734f835e6fac54e]}
20/06/05 07:24:41 DEBUG Tasks: Waiting for 2 tasks to complete
20/06/05 07:24:41 DEBUG Tasks: Executing task
20/06/05 07:24:41 DEBUG Tasks: Waiting for 2 tasks to complete
20/06/05 07:24:41 DEBUG CommitOperations: Committing single commit
DelayedCompleteData{version=1,
uri='s3://bucket/spark/tables/planex/table-20200605-26635-TbaguBgC8ef/other_namespace/test_a.orc/part-00000-0e2def2a-b7fe-480f-bd3b-5b9cbd6decf8-c000-app-20200605072420-0016.zlib.orc',
destination='spark/tables/planex/table-20200605-26635-TbaguBgC8ef/other_namespace/test_a.orc/part-00000-0e2def2a-b7fe-480f-bd3b-5b9cbd6decf8-c000-app-20200605072420-0016.zlib.orc',
uploadId='7hquC9BXz_1Hni0aK3n_577vRxoGTHz5ulkKoQ7BxhZJpqnpXOimyBd3xhZSabs1bnkKHt_WocUtSHFVEb2fO_98lsdyN4P.hby4rCn38wMxJGZMTC8Gl6I8UT5ZpoUp',
created=1591341880822, saved=1591341880822, size=597, date='Fri Jun 05
07:24:40 UTC 2020', jobId='', taskId='', notes='',
etags=[fbefef72e70eb0a25734f835e6fac54e]}
20/06/05 07:24:41 DEBUG Tasks: Executing task
20/06/05 07:24:41 DEBUG CommitOperations: Committing single commit
DelayedCompleteData{version=1,
uri='s3://bucket/spark/tables/planex/table-20200605-26635-TbaguBgC8ef/test_b.parquet/part-00000-171fd21c-4d3c-4c41-a329-e8d1e9ffd108-c000-app-20200605072420-0016.snappy.parquet',
destination='spark/tables/planex/table-20200605-26635-TbaguBgC8ef/test_b.parquet/part-00000-171fd21c-4d3c-4c41-a329-e8d1e9ffd108-c000-app-20200605072420-0016.snappy.parquet',
uploadId='SnSQDLeughVPLcrE1KaqPs5LsxCKirUhNpZPQXa5zqCAcYMfD6pLi6DaUVTawZWl9eii0kjxGht7ZG1JfyhqGOvBe99bE.7JH0oAZ5AzaKTggb.eecGL1vfccc1aV7Wg',
created=1591341881205, saved=1591341881205, size=1069, date='Fri Jun 05
07:24:41 UTC 2020', jobId='', taskId='', notes='',
etags=[00c2433b77ce079387d52ebb343e09c1]}
20/06/05 07:24:41 DEBUG CommitOperations: Successful commit of file length 1069
20/06/05 07:24:41 DEBUG Tasks: Task succeeded
20/06/05 07:24:41 DEBUG Tasks: Finished count -> 1/2
20/06/05 07:24:41 DEBUG CommitOperations: Successful commit of file length 1069
20/06/05 07:24:41 DEBUG Tasks: Task succeeded
20/06/05 07:24:41 DEBUG Tasks: Finished count -> 1/2
20/06/05 07:24:41 DEBUG CommitOperations: Successful commit of file length 597
20/06/05 07:24:41 DEBUG Tasks: Task succeeded
20/06/05 07:24:41 DEBUG Tasks: Finished count -> 2/2
20/06/05 07:24:41 DEBUG CommitOperations: Touching success marker for job
s3://bucket/spark/tables/planex/table-20200605-26635-TbaguBgC8ef/other_namespace/test_a.orc/_SUCCESS:
SuccessData{committer='directory', hostname='ip-172-17-252-12',
description='Task committer attempt_20200605072439_0000_m_000000_0', date='Fri
Jun 05 07:24:41 UTC 2020',
filenames=[/spark/tables/planex/table-20200605-26635-TbaguBgC8ef/other_namespace/test_a.orc/part-00000-0e2def2a-b7fe-480f-bd3b-5b9cbd6decf8-c000-app-20200605072420-0016.zlib.orc,
/spark/tables/planex/table-20200605-26635-TbaguBgC8ef/test_b.parquet/part-00000-171fd21c-4d3c-4c41-a329-e8d1e9ffd108-c000-app-20200605072420-0016.snappy.parquet]}
20/06/05 07:24:41 DEBUG CommitOperations: Successful commit of file length 597
20/06/05 07:24:41 DEBUG Tasks: Task succeeded
20/06/05 07:24:41 DEBUG Tasks: Finished count -> 2/2
20/06/05 07:24:41 DEBUG CommitOperations: Touching success marker for job
s3://bucket/spark/tables/planex/table-20200605-26635-TbaguBgC8ef/test_b.parquet/_SUCCESS:
SuccessData{committer='directory', hostname='ip-172-17-252-12',
description='Task committer attempt_20200605072439_0000_m_000000_0', date='Fri
Jun 05 07:24:41 UTC 2020',
filenames=[/spark/tables/planex/table-20200605-26635-TbaguBgC8ef/test_b.parquet/part-00000-171fd21c-4d3c-4c41-a329-e8d1e9ffd108-c000-app-20200605072420-0016.snappy.parquet,
/spark/tables/planex/table-20200605-26635-TbaguBgC8ef/other_namespace/test_a.orc/part-00000-0e2def2a-b7fe-480f-bd3b-5b9cbd6decf8-c000-app-20200605072420-0016.zlib.orc]}
20/06/05 07:24:42 INFO AbstractS3ACommitter: Starting: Cleanup job (no job ID)
20/06/05 07:24:42 INFO AbstractS3ACommitter: Starting: Aborting all pending
commits under
s3://bucket/spark/tables/planex/table-20200605-26635-TbaguBgC8ef/other_namespace/test_a.orc
20/06/05 07:24:42 DEBUG Tasks: Waiting for 0 tasks to complete
20/06/05 07:24:42 INFO AbstractS3ACommitter: Aborting all pending commits under
s3://bucket/spark/tables/planex/table-20200605-26635-TbaguBgC8ef/other_namespace/test_a.orc:
duration 0:00.019s
20/06/05 07:24:42 INFO AbstractS3ACommitter: Cleanup job (no job ID): duration
0:00.019s
20/06/05 07:24:42 DEBUG StagingCommitter: Cleaning up work path
file:/opt/spark-2.4.4-without-hadoop-scala-2.12/work/driver-20200605072356-0020/s3a/app-20200605072420-0016/_temporary/0/_temporary/attempt_20200605072439_0000_m_000000_0
20/06/05 07:24:42 INFO AbstractS3ACommitter: Task committer
attempt_20200605072439_0000_m_000000_0: commitJob((no job ID)): duration
0:00.789s
20/06/05 07:24:42 INFO AbstractS3ACommitter: Starting: Cleanup job (no job ID)
20/06/05 07:24:42 INFO AbstractS3ACommitter: Starting: Aborting all pending
commits under
s3://bucket/spark/tables/planex/table-20200605-26635-TbaguBgC8ef/test_b.parquet
20/06/05 07:24:42 DEBUG Tasks: Waiting for 0 tasks to complete
20/06/05 07:24:42 INFO AbstractS3ACommitter: Aborting all pending commits under
s3://bucket/spark/tables/planex/table-20200605-26635-TbaguBgC8ef/test_b.parquet:
duration 0:00.016s
20/06/05 07:24:42 INFO AbstractS3ACommitter: Cleanup job (no job ID): duration
0:00.017s
20/06/05 07:24:42 DEBUG StagingCommitter: Cleaning up work path
file:/opt/spark-2.4.4-without-hadoop-scala-2.12/work/driver-20200605072356-0020/s3a/app-20200605072420-0016/_temporary/0/_temporary/attempt_20200605072439_0000_m_000000_0
20/06/05 07:24:42 INFO AbstractS3ACommitter: Task committer
attempt_20200605072439_0000_m_000000_0: commitJob((no job ID)): duration
0:00.800s
20/06/05 07:24:42 INFO FileFormatWriter: Write Job
9c462cc6-ffcb-4739-8831-62d880205100 committed.
20/06/05 07:24:42 INFO FileFormatWriter: Finished processing stats for write
job 9c462cc6-ffcb-4739-8831-62d880205100.
20/06/05 07:24:42 INFO FileFormatWriter: Write Job
f90461b6-45a4-4112-8e1c-bee869c37bd0 committed.
20/06/05 07:24:42 INFO FileFormatWriter: Finished processing stats for write
job f90461b6-45a4-4112-8e1c-bee869c37bd0.
{noformat}
> S3A staging committer committing duplicate files
> ------------------------------------------------
>
> Key: HADOOP-17066
> URL: https://issues.apache.org/jira/browse/HADOOP-17066
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Affects Versions: 3.3.0, 3.2.1, 3.1.3
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Priority: Major
>
> SPARK-39111 reporting concurrent jobs double writing files.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]