[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

tgravescs Mon, 20 Aug 2018 07:46:14 -0700

Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/22112
  
    > The FileCommitProtocol is an internal API, and our current implementation 
does store task-level data temporary in a staging directory (See 
HadoopMapReduceCommitProtocol). That said, we can fix the FileCommitProtocol to 
be able to rollback a committed task, as long as the job is not committed.
    > 
    > As an example, we can change the semantic of 
FileCommitProtocol.commitTask, and say that this may be called multiple times 
for the same task, and the later should win. And change 
HadoopMapReduceCommitProtocol to support it.
    
    Hmm I wasn't aware that option was added, looks like its only if you use 
the dynamicPartitionOverwrite which looks like its for very specific output 
committers for sql.  I don't see how that works with all output committers.  I 
can write an output committer that never writes anything to disk, it might 
write it to a DB, Hbase, or any custom one.  Some of these moves don't make 
sense in.
    
    Note that its also a performance impact, that is why the mapreduce output 
committers stopped doing this, see the v2 algorithm set via 
mapreduce.fileoutputcommitter.algorithm.version . 
(https://issues.apache.org/jira/browse/SPARK-20107).  I realize the performance 
might need to be impacted in certain cases but wanted to point this out at 
least, my main concern is really the above statement I don't see how this works 
for all output committers



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

Reply via email to