[GitHub] spark pull request #20824: [SPARK-23683][SQL][FOLLOW-UP] FileCommitProtocol....

steveloughran Fri, 16 Mar 2018 03:35:20 -0700

Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20824#discussion_r175050086
  
    --- Diff: 
core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala ---
    @@ -145,15 +146,23 @@ object FileCommitProtocol {
           jobId: String,
           outputPath: String,
           dynamicPartitionOverwrite: Boolean = false): FileCommitProtocol = {
    +
    +    logDebug(s"Creating committer $className; job $jobId; 
output=$outputPath;" +
    +      s" dynamic=$dynamicPartitionOverwrite")
         val clazz = 
Utils.classForName(className).asInstanceOf[Class[FileCommitProtocol]]
         // First try the constructor with arguments (jobId: String, 
outputPath: String,
         // dynamicPartitionOverwrite: Boolean).
         // If that doesn't exist, try the one with (jobId: string, outputPath: 
String).
         try {
           val ctor = clazz.getDeclaredConstructor(classOf[String], 
classOf[String], classOf[Boolean])
    +      logDebug("Using (String, String, Boolean) constructor")
           ctor.newInstance(jobId, outputPath, 
dynamicPartitionOverwrite.asInstanceOf[java.lang.Boolean])
         } catch {
           case _: NoSuchMethodException =>
    +        logDebug("Falling back to (String, String) constructor")
    +        require(!dynamicPartitionOverwrite,
    +          "Dynamic Partition Overwrite is enabled but" +
    +            s" the committer ${className} does not have the appropriate 
constructor")
    --- End diff --
    
    something like that would work, though it'd be bit more 
convoluted...InsertIntoFSRelation would have to check, and then handle the 
situation of missing support.
    
    One thing to consider in any form is: all implementations of 
FileCommitProtocol should be aware of the new Dynamic Partition overwrite 
feature...adding a new 3-arg constructor is an implicit way of saying "I 
understand this". Where it's weak is there's no way for for it to say "I 
understand this and will handle it myself" Because essentially that's what 
being done in the [Netflix Partioned 
committer(https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/staging/PartitionedStagingCommitter.java#L142),
 which purges all parts for which the new job has data. With that committer, if 
the insert asks for the feature then the FileCommitProtocol binding to it could 
(somehow) turn this on and so handle everything internally. 
    
    Like I said, a more complex model. It'd need changes a fair way through 
things and then the usual complexity of getting commit logic.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20824: [SPARK-23683][SQL][FOLLOW-UP] FileCommitProtocol....

Reply via email to