[
https://issues.apache.org/jira/browse/SPARK-10006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robbie Russo updated SPARK-10006:
---------------------------------
Priority: Critical (was: Major)
> Locality broken in spark 1.4.x for NewHadoopRDD
> -----------------------------------------------
>
> Key: SPARK-10006
> URL: https://issues.apache.org/jira/browse/SPARK-10006
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.4.0, 1.4.1
> Reporter: Robbie Russo
> Priority: Critical
> Labels: performance
>
> After upgrading to spark 1.4.x, locality seems to be entirely broken for
> NewHadoopRDD with a spark cluster that is co-located with an HDFS cluster.
> Whereas an identical job run in spark 1.2.x or 1.3.x for us would run all
> partitions with locality level NODE_LOCAL, after upgrading to 1.4.x the
> locality level switched to ANY for all partitions.
> Furthermore it appears to be somehow launching the tasks in order of their
> locations or something to that effect because there are hotspots of 1 node at
> a time with completely maxed resources during the read. To test this theory i
> wrote a job that scans for all the files in the driver, parallelizes the list
> and then loads the files back through the hadoop API in a mapPartitions
> function (which correct me if i'm wrong but this should be identical to using
> ANY locality?) and the result was that my hack was 4x faster than letting
> spark parse the files itself!
> As for performance effect, this has caused a 12x slowdown for us from 1.3.1
> to 1.4.1. Needless to say we have downgraded back for now and everything
> appears to work normally again now.
> We were able to reproduce this behavior on multiple clusters and also on both
> hadoop 2.4 and hadoop 2.6 (I saw that there were 2 different code paths
> depending on the existence of of hadoop 2.6 for figuring out preferred
> locations). The only thing that has fixed the problem for us is to downgrade
> back to 1.3.1.
> Not sure how helpful it will be but through reflection i checked the results
> of calling on the RDD the getPreferredLocations method and it returned me an
> empty List on both 1.3.1 where it works and on 1.4.1 where it doesn't. I also
> tried called the function getPreferredLocs on the spark context with the RDD
> and that actually properly gave me back the 3 locations of the partition i
> passed it in both 1.3.1 and 1.4.1. So as far as i can tell the logic for
> getPreferredLocs and getPreferredLocations seems to match across versions and
> it appears to be that the use of this information in the scheduler is what
> must have changed. However I could not find many references to either of
> these 2 functions so I was not able to debug much further.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]