[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

zheh12 Tue, 08 May 2018 19:16:01 -0700

Github user zheh12 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r186915569
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala
 ---
    @@ -207,9 +207,25 @@ case class InsertIntoHadoopFsRelationCommand(
         }
         // first clear the path determined by the static partition keys (e.g. 
/table/foo=1)
         val staticPrefixPath = 
qualifiedOutputPath.suffix(staticPartitionPrefix)
    -    if (fs.exists(staticPrefixPath) && !committer.deleteWithJob(fs, 
staticPrefixPath, true)) {
    -      throw new IOException(s"Unable to clear output " +
    -        s"directory $staticPrefixPath prior to writing to it")
    +
    +    // check if delete the dir or just sub files
    +    if (fs.exists(staticPrefixPath)) {
    +      // check if is he table root, and record the file to delete
    +      if (staticPartitionPrefix.isEmpty) {
    +        val files = fs.listFiles(staticPrefixPath, false)
    +        while (files.hasNext) {
    +          val file = files.next()
    +          if (!committer.deleteWithJob(fs, file.getPath, true)) {
    --- End diff --
    
    The key is that the data is already in the `output` directory before 
committing job, and we can't delete the `output` directory anymore.
    
    We overloaded `FileCommitProtocol` in the `HadoopMapReduceCommitProtocol` 
with the `deleteWithJob` method. Now it will not delete the file immediately, 
but it will wait until the entire job is committed.
    
    We did delete the files with committed the job, but the temporary output 
files were generated when the task was started. These temporary output files 
are in the `output` directory.  And the data will be move out to the `output` 
directory. 
    
    After the job starts, there is no safe time to delete the entire `output` 
directory.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Reply via email to