koertkuipers commented on a change in pull request #26971: [SPARK-30320][SQL]
Fix insert overwrite to DataSource table with dynamic partition error
URL: https://github.com/apache/spark/pull/26971#discussion_r404355749
##########
File path:
core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
##########
@@ -183,22 +187,30 @@ class HadoopMapReduceCommitProtocol(
}
if (dynamicPartitionOverwrite) {
- val partitionPaths = allPartitionPaths.foldLeft(Set[String]())(_ ++ _)
- logDebug(s"Clean up default partition directories for overwriting:
$partitionPaths")
- for (part <- partitionPaths) {
- val finalPartPath = new Path(path, part)
- if (!fs.delete(finalPartPath, true) &&
!fs.exists(finalPartPath.getParent)) {
- // According to the official hadoop FileSystem API spec, delete op
should assume
- // the destination is no longer present regardless of return
value, thus we do not
- // need to double check if finalPartPath exists before rename.
- // Also in our case, based on the spec, delete returns false only
when finalPartPath
- // does not exist. When this happens, we need to take action if
parent of finalPartPath
- // also does not exist(e.g. the scenario described on
SPARK-23815), because
- // FileSystem API spec on rename op says the rename
dest(finalPartPath) must have
- // a parent that exists, otherwise we may get unexpected result on
the rename.
- fs.mkdirs(finalPartPath.getParent)
- }
- fs.rename(new Path(stagingDir, part), finalPartPath)
+ val allPartitionPaths = partitionPathsAttemptIDPair.map {
+ case (allPartitionPath, successAttemptID) =>
+ allPartitionPath.foreach(part => {
+ val finalPartPath = new Path(path, part)
+ if (!fs.delete(finalPartPath, true) &&
!fs.exists(finalPartPath.getParent)) {
+ // According to the official hadoop FileSystem API spec,
delete op should assume
+ // the destination is no longer present regardless of return
value, thus we do not
+ // need to double check if finalPartPath exists before rename.
+ // Also in our case, based on the spec, delete returns false
only when finalPartPath
+ // does not exist. When this happens, we need to take action
if parent of
+ // finalPartPath also does not exist(e.g. the scenario
described on SPARK-23815),
+ // because FileSystem API spec on rename op says the rename
dest(finalPartPath)
+ // must have a parent that exists, otherwise we may get
unexpected result
+ // on the rename.
+ fs.mkdirs(finalPartPath.getParent)
+ }
+ fs.rename(new Path(s"$stagingDir/$successAttemptID", part),
finalPartPath)
Review comment:
do i understand it correctly that part here is a directory (e.g. x=1/y=2),
not a file? so a directory full of files is being moved.
if so couldn't multiple tasks write to the same partition? and then wouldnt
these moves conflict with each other?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]