Robbie Russo created SPARK-10006:
------------------------------------
Summary: Locality broken in spark 1.4.x for NewHadoopRDD
Key: SPARK-10006
URL: https://issues.apache.org/jira/browse/SPARK-10006
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.4.1, 1.4.0
Reporter: Robbie Russo
After upgrading to spark 1.4.x, locality seems to be entirely broken for
NewHadoopRDD with a spark cluster that is co-located with an HDFS cluster.
Whereas an identical job run in spark 1.2.x or 1.3.x for us would run all
partitions with locality level NODE_LOCAL, after upgrading to 1.4.x the
locality level switched to ANY for all partitions.
Furthermore it appears to be somehow launching the tasks in order of their
locations or something to that effect because there are hotspots of 1 node at a
time with completely maxed resources during the read. To test this theory i
wrote a job that scans for all the files in the driver, parallelizes the list
and then loads the files back through the hadoop API in a mapPartitions
function (which correct me if i'm wrong but this should be identical to using
ANY locality?) and the result was that my hack was 4x faster than letting spark
parse the files itself!
As for performance effect, this has caused a 12x slowdown for us from 1.3.1 to
1.4.1. Needless to say we have downgraded back for now and everything appears
to work normally again now.
We were able to reproduce this behavior on multiple clusters and also on both
hadoop 2.4 and hadoop 2.6 (I saw that there were 2 different code paths
depending on the existence of of hadoop 2.6 for figuring out preferred
locations). The only thing that has fixed the problem for us is to downgrade
back to 1.3.1.
Not sure how helpful it will be but through reflection i checked the results of
calling on the RDD the getPreferredLocations method and it returned me an empty
List on both 1.3.1 where it works and on 1.4.1 where it doesn't. I also tried
called the function getPreferredLocs on the spark context with the RDD and that
actually properly gave me back the 3 locations of the partition i passed it in
both 1.3.1 and 1.4.1. So as far as i can tell the logic for getPreferredLocs
and getPreferredLocations seems to match across versions and it appears to be
that the use of this information in the scheduler is what must have changed.
However I could not find many references to either of these 2 functions so I
was not able to debug much further.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]