Robbie Russo created SPARK-10006:
------------------------------------

             Summary: Locality broken in spark 1.4.x for NewHadoopRDD
                 Key: SPARK-10006
                 URL: https://issues.apache.org/jira/browse/SPARK-10006
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.4.1, 1.4.0
            Reporter: Robbie Russo


After upgrading to spark 1.4.x, locality seems to be entirely broken for 
NewHadoopRDD with a spark cluster that is co-located with an HDFS cluster. 
Whereas an identical job run in spark 1.2.x or 1.3.x for us would run all 
partitions with locality level NODE_LOCAL, after upgrading to 1.4.x the 
locality level switched to ANY for all partitions. 

Furthermore it appears to be somehow launching the tasks in order of their 
locations or something to that effect because there are hotspots of 1 node at a 
time with completely maxed resources during the read. To test this theory i 
wrote a job that scans for all the files in the driver, parallelizes the list 
and then loads the files back through the hadoop API in a mapPartitions 
function (which correct me if i'm wrong but this should be identical to using 
ANY locality?) and the result was that my hack was 4x faster than letting spark 
parse the files itself!

As for performance effect, this has caused a 12x slowdown for us from 1.3.1 to 
1.4.1. Needless to say we have downgraded back for now and everything appears 
to work normally again now.

We were able to reproduce this behavior on multiple clusters and also on both 
hadoop 2.4 and hadoop 2.6 (I saw that there were 2 different code paths 
depending on the existence of of hadoop 2.6 for figuring out preferred 
locations). The only thing that has fixed the problem for us is to downgrade 
back to 1.3.1.


Not sure how helpful it will be but through reflection i checked the results of 
calling on the RDD the getPreferredLocations method and it returned me an empty 
List on both 1.3.1 where it works and on 1.4.1 where it doesn't. I also tried 
called the function getPreferredLocs on the spark context with the RDD and that 
actually properly gave me back the 3 locations of the partition i passed it in 
both 1.3.1 and 1.4.1. So as far as i can tell the logic for getPreferredLocs 
and getPreferredLocations seems to match across versions and it appears to be 
that the use of this information in the scheduler is what must have changed. 
However I could not find many references to either of these 2 functions so I 
was not able to debug much further.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to