[GitHub] spark pull request #21066: [SPARK-23977][CLOUD][WIP] Add commit protocol bin...

steveloughran Mon, 07 May 2018 08:55:12 -0700

Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21066#discussion_r186466259
  
    --- Diff: 
hadoop-cloud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/BindingParquetOutputCommitter.scala
 ---
    @@ -0,0 +1,122 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.internal.io.cloud
    +
    +import java.io.IOException
    +
    +import org.apache.hadoop.fs.Path
    +import org.apache.hadoop.mapreduce.lib.output.{BindingPathOutputCommitter, 
PathOutputCommitter}
    +import org.apache.hadoop.mapreduce.{JobContext, JobStatus, 
TaskAttemptContext}
    +import org.apache.parquet.hadoop.ParquetOutputCommitter
    +
    +import org.apache.spark.internal.Logging
    +
    +
    +/**
    + * This dynamically binds to the factory-configured
    + * output committer, and is intended to allow callers to use any 
[[PathOutputCommitter]],
    + * even if not a subclass of [[ParquetOutputCommitter]].
    + *
    + * The Parquet "parquet.enable.summary-metadata" option will only be 
supported
    + * if the instantiated committer itself supports it.
    + */
    +
    +class BindingParquetOutputCommitter(
    +    path: Path,
    +    context: TaskAttemptContext)
    +  extends ParquetOutputCommitter(path, context) with Logging {
    +
    +  logInfo(s"${this.getClass.getName} binding to configured 
PathOutputCommitter and dest $path")
    +
    +  val committer = new BindingPathOutputCommitter(path, context)
    +
    +  /**
    +   * This is the committer ultimately bound to.
    +   * @return the committer instantiated by the factory.
    +   */
    +  def boundCommitter(): PathOutputCommitter = {
    +    committer.getCommitter()
    +  }
    +
    +  override def getWorkPath: Path = {
    +    committer.getWorkPath()
    +  }
    +
    +  override def setupTask(taskAttemptContext: TaskAttemptContext): Unit = {
    +    committer.setupTask(taskAttemptContext)
    +  }
    +
    +  override def commitTask(taskAttemptContext: TaskAttemptContext): Unit = {
    +    committer.commitTask(taskAttemptContext)
    +  }
    +
    +  override def abortTask(taskAttemptContext: TaskAttemptContext): Unit = {
    +    committer.abortTask(taskAttemptContext)
    +  }
    +
    +  override def setupJob(jobContext: JobContext): Unit = {
    +    committer.setupJob(jobContext)
    +  }
    +
    +  override def needsTaskCommit(taskAttemptContext: TaskAttemptContext): 
Boolean = {
    +    committer.needsTaskCommit(taskAttemptContext)
    +  }
    +
    +  override def cleanupJob(jobContext: JobContext): Unit = {
    +    committer.cleanupJob(jobContext)
    +  }
    +
    +  override def isCommitJobRepeatable(jobContext: JobContext): Boolean = {
    +    committer.isCommitJobRepeatable(jobContext)
    +  }
    +
    +  override def commitJob(jobContext: JobContext): Unit = {
    +    committer.commitJob(jobContext)
    +  }
    +
    +  override def recoverTask(taskAttemptContext: TaskAttemptContext): Unit = 
{
    +    committer.recoverTask(taskAttemptContext)
    +  }
    +
    +  /**
    +   * Abort the job; log and ignore any IO exception thrown.
    +   *
    +   * @param jobContext job context
    +   * @param state final state of the job
    +   */
    +  override def abortJob(
    +      jobContext: JobContext,
    +      state: JobStatus.State): Unit = {
    +    try {
    +      committer.abortJob(jobContext, state)
    +    } catch {
    +      case e: IOException =>
    --- End diff --
    
    That's exactly the question @mridulm asked, which is why the next commit to 
this PR will mention in comments. Essentially: this abort operation is 
regularly used in exception handling code, and this code tends to assume that 
the abort() routine doesn't fail. If it does, then it can get rethrown and so 
hide the underlying failure which triggered the abort. 
    
    There's  an underlying question "what do you do when the abort operation 
itself fails", which lurks. 
    For HDFS &c, with the FS used as the dest, then `rm  -rf $dest` does that 
cleanup For S3, uncommitted uploads still incur charges, so you need to define 
a maximum lifespan of outstanding requests , e.g. 24h, and/or run the new 
hadoop cli calls to list/abort MPUs under a path. for a database, well, it's up 
to the DB I guess.
    
    What is important, though it's never explicitly called out is: *the 
uncommitted work of a previous job attempt must never form part of the final 
output of a successor*. The error handling here doesn't do anything to help or 
hinder that.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21066: [SPARK-23977][CLOUD][WIP] Add commit protocol bin...

Reply via email to