Hi Yu, We upgraded breeze to 0.10 yesterday. So we can call the distance functions you contributed to breeze easily. We don't want to maintain another copy of the implementation in MLlib to keep the maintenance cost low. Both spark and breeze are open-source projects. We should try our best to avoid duplicate effort and forking, even though we don't have control the release of breeze.
As we discussed in the PR, if we want users to call them directly, they should live in breeze. If we want users to specify them in clustering algorithms, we should hide the implementation from users. So simple wrappers over the breeze implementation should be sufficient. We are reviewing https://github.com/apache/spark/pull/2634 and try to see how we can embed distance measures there. In the k-means implementation, we don't use (Vector, Vector) => Double. Instead, we cache the norms and use inner product to derive the distance, which is faster and takes advantage of sparsity. It would be really nice if you can help review it and discuss how to embed distance measures there. Thanks! Best, Xiangrui On Wed, Oct 8, 2014 at 4:19 AM, Yu Ishikawa <yuu.ishikawa+sp...@gmail.com> wrote: > Hi all, > > In my limited understanding of the MLlib, it is a good idea to use the > various distance functions on some machine learning algorithms. For example, > we can only use Euclidean distance metric in KMeans. And I am tackling with > contributing hierarchical clustering to MLlib > (https://issues.apache.org/jira/browse/SPARK-2429). I would like to support > the various distance functions in it. > > Should we support the standardized distance function in MLlib or not? > You know, Spark depends on Breeze. So I think we have two approaches in > order to use distance functions in MLlib. One is implementing some distance > functions in MLlib. The other is wrapping the functions of Breeze. And I am > a bit worried about using Breeze directly in Spark. For example, we can't > absolutely control the release of Breeze. > > I sent a PR before. But it is stopping. I'd like to get your thoughts on it, > community. > https://github.com/apache/spark/pull/1964#issuecomment-54953348 > > Best, > > > > ----- > -- Yu Ishikawa > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/Standardized-Distance-Functions-in-MLlib-tp8697.html > Sent from the Apache Spark Developers List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org