turboFei commented on pull request #26339: URL: https://github.com/apache/spark/pull/26339#issuecomment-653318727
> > > > close this and will a create a new pr with new solution. thanks > > > > > > > > > why close this? did you find a better approach? > > > > > > Hi, here is the new patch. > > In the new solution, I define a new OutputCommitter. > > I am stilling working on it. > > #28989 > > thank you. curious why you changed direction... if there is anything wrong with approach in this pullreq? we were just about to start testing it at scale that's why i ask. > best In the origin solution, when renaming staging task file to final file. We judge whether the final file exists and then judge whether rename staging task file. It is tricky that the final files may from different tasks. If the task output for a partition has multi files(or bucket table insert case), the data might be corrupted. So, we need outputCommitCoordinator to help decide which task can commit. In the new solution, we define a new output committer to leverage outputCommitCoordinator(by invoking SparkHadoopMapRedUtil.commitTask) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
