Github user yu-iskw commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-54953348
@mengxr, @erikerlandson, @rnowling Sorry for having been kept waiting. I
redesigned the interface of distance functions. Here is the full file change.
https://github.com/apache/spark/pull/1964/files
I agree to keep MLlib's linear algebra lightweight. And we would had better
to use Breeze in terms of the performance. However, I think the standardized
distance function API helps us to expand MLlib efficiently. And also, by
contributing another distance functions to Breeze, we can use them in Spark.
I am a bit worried about using Breeze directly in Spark. One reason is that
we can't absolutely control the release of Breeze. If we want to use a new
distance function in Spark, we contribute it to Breeze as I did. Because there
is a time difference between being merged and release. We have to wait. That's
why we should have a wrapper API for Breeze linear algebra. And the wrapper API
helps us to extend another distance function easily.
I think we need two the standardized distance function API. One is for
Spark user, the other is for MLlib developer. `(Vector, Vector) => Double` is
for Spark user. `(BV[Double], BV[Double]) => Double ` is for MLlib developer.
And Breeze should be hidden for Spark user. At least, we should have a mapping
function from `String` to a distance function because of PySpark API.
We have to use the implementation in a wrapper API as much as we can. That
is, you know, I contributed some distance functions to Breeze. However,they
have not been released on Maven repository. That's why we can't use them in
Spark yet. If they will be released, I will modify the implementation, such as
the `apply` method of `EuclideanDistanceMetric` class.
## Diff Summary
- `(BV[Double], BV[Double]) => Double` is for Spark user, such as
`EuclideanDistanceMetric` class
- Companion Object of the distance Function is for MLlib developer, such as
`EuclideanDistanceMetric` object
- Extract WeightedDistanceMetric and its sub class from DistanceMetric
- Extract WeightedDistanceMeasure and its sub class from DistanceMeasure
- Add a mapping function in
`mllib/src/main/scala/org/apache/spark/mllib/linalg/distance/DistanceMeasureFactory.scala`
thanks
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]