[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

tgravescs Mon, 13 Aug 2018 09:26:00 -0700

Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21698
  
    if you are looking at recomputing how are you going to handle if some tasks 
have already written output?  This was brought up by @cloud-fan above and I 
didn't see a response.  Some output formats have a task commit and then a job 
commit so it may work for those, but others might not have that.
    
    > Unfortunately we are not able to deliver them on 2.4, but I'm optimistic 
we may include them in 3.0 and of course backport them to all the active 
branches.
    
    I really disagree with this.  We need to fix this in some way before 2.4 
release.  If the sort way is a fix but performance regression we should do that 
as its at least fixed by default.  We have the config for people who are ok 
with possible corruption and just want the performance.  I wouldn't think its 
any worse then what is there for dataframes based on what you have said.
    I totally understand if you won't have time to work on this but perhaps 
others will. 
    I plan on working on this now but would be nice for us to come up with a 
overall approach.
    
    did anyone run benchmarks on the fix for dataframes?   I'm really curious 
what the real performance implications are.  
    
    Note that Apache PIG also had a similar issue withe the round robin 
partitioner and they removed it and used a hash value partitioner.   Spark is 
obviously different but the underlying issue is the same.  I would actually 
prefer to see us just use the hash partitioner if we can't find a better 
solution.  Our official docs I don't think says it repartitions evenly 
(http://spark.apache.org/docs/2.3.1/api/scala/index.html#org.apache.spark.rdd.RDD),
 but our programming guide does:
    http://spark.apache.org/docs/latest/rdd-programming-guide.html
    "
    Reshuffle the data in the RDD randomly to create either more or fewer 
partitions and balance it across them. This always shuffles all data over the 
network.
    --
    "
    
    So I could see the argument that says we can't change that behavior.




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

Reply via email to