[ https://issues.apache.org/jira/browse/SPARK-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14120432#comment-14120432 ]
Evan Sparks commented on SPARK-3384: ------------------------------------ I agree with Sean. Avoiding the costly penalty of object allocation overhead is important to avoid here. As far as I can tell, we are using reduceByKey in the prescribed way (see: http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3cecd3c09a-50f3-4683-a639-daddc4101...@gmail.com%3E) mutating the left input. I don't believe that spark needs this mutation to be thread-safe, because it executes the combine sequentially on all workers, and then reduces sequentially on the master, but I could be wrong. > Potential thread unsafe Breeze vector addition in KMeans > -------------------------------------------------------- > > Key: SPARK-3384 > URL: https://issues.apache.org/jira/browse/SPARK-3384 > Project: Spark > Issue Type: Bug > Components: MLlib > Reporter: RJ Nowling > > In the KMeans clustering implementation, the Breeze vectors are accumulated > using +=. For example, > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L162 > This is potentially a thread unsafe operation. (This is what I observed in > local testing.) I suggest changing the += to + -- a new object will be > allocated but it will be thread safe since it won't write to an old location > accessed by multiple threads. > Further testing is required to reproduce and verify. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org