[ 
https://issues.apache.org/jira/browse/SPARK-10006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robbie Russo updated SPARK-10006:
---------------------------------
    Priority: Critical  (was: Major)

> Locality broken in spark 1.4.x for NewHadoopRDD
> -----------------------------------------------
>
>                 Key: SPARK-10006
>                 URL: https://issues.apache.org/jira/browse/SPARK-10006
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.4.0, 1.4.1
>            Reporter: Robbie Russo
>            Priority: Critical
>              Labels: performance
>
> After upgrading to spark 1.4.x, locality seems to be entirely broken for 
> NewHadoopRDD with a spark cluster that is co-located with an HDFS cluster. 
> Whereas an identical job run in spark 1.2.x or 1.3.x for us would run all 
> partitions with locality level NODE_LOCAL, after upgrading to 1.4.x the 
> locality level switched to ANY for all partitions. 
> Furthermore it appears to be somehow launching the tasks in order of their 
> locations or something to that effect because there are hotspots of 1 node at 
> a time with completely maxed resources during the read. To test this theory i 
> wrote a job that scans for all the files in the driver, parallelizes the list 
> and then loads the files back through the hadoop API in a mapPartitions 
> function (which correct me if i'm wrong but this should be identical to using 
> ANY locality?) and the result was that my hack was 4x faster than letting 
> spark parse the files itself!
> As for performance effect, this has caused a 12x slowdown for us from 1.3.1 
> to 1.4.1. Needless to say we have downgraded back for now and everything 
> appears to work normally again now.
> We were able to reproduce this behavior on multiple clusters and also on both 
> hadoop 2.4 and hadoop 2.6 (I saw that there were 2 different code paths 
> depending on the existence of of hadoop 2.6 for figuring out preferred 
> locations). The only thing that has fixed the problem for us is to downgrade 
> back to 1.3.1.
> Not sure how helpful it will be but through reflection i checked the results 
> of calling on the RDD the getPreferredLocations method and it returned me an 
> empty List on both 1.3.1 where it works and on 1.4.1 where it doesn't. I also 
> tried called the function getPreferredLocs on the spark context with the RDD 
> and that actually properly gave me back the 3 locations of the partition i 
> passed it in both 1.3.1 and 1.4.1. So as far as i can tell the logic for 
> getPreferredLocs and getPreferredLocations seems to match across versions and 
> it appears to be that the use of this information in the scheduler is what 
> must have changed. However I could not find many references to either of 
> these 2 functions so I was not able to debug much further.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to