turboFei commented on pull request #26339:
URL: https://github.com/apache/spark/pull/26339#issuecomment-653318727


   > > > > close this and will a create a new pr with new solution. thanks
   > > > 
   > > > 
   > > > why close this? did you find a better approach?
   > > 
   > > 
   > > Hi, here is the new patch.
   > > In the new solution, I define a new OutputCommitter.
   > > I am stilling working on it.
   > > #28989
   > 
   > thank you. curious why you changed direction... if there is anything wrong 
with approach in this pullreq? we were just about to start testing it at scale 
that's why i ask.
   > best
   
   In the origin solution, when renaming staging task file to final file.
   We judge whether the final file exists and then judge whether rename staging 
task file.
   
   It is tricky that the final files may from different tasks.
   
   If the task output for a partition has multi files(or bucket table insert 
case), the data might be corrupted.
   
   So, we need outputCommitCoordinator to help decide which task can commit.
   
   In the new solution, we define a new output committer to leverage 
outputCommitCoordinator(by invoking SparkHadoopMapRedUtil.commitTask)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to