Re: [scikit-learn] distances

Andreas Mueller Thu, 05 Mar 2020 13:15:24 -0800

Thanks for a great summary of issues!

I agree there's lots to do, though I think most of the issues that youlist are quite hard and require thinking about API pretty hard.So they might not be super amendable to being solved by a shorter-termproject.

I was hoping there would be some more easy wins that we could get byexploiting OpenMP better (or at all) in the distances.

Not sure if there is, though.

I wonder if having a multicore implementation of euclidean_distanceswould be useful for us, or if that's going too low-level.




On 3/3/20 5:47 PM, Joel Nothman wrote:

I noticed a comment by @amueller on Gitter re considering a project onour distances implementations.
I think there's a lot of work that can be done in unifying distancesimplementations... (though I'm not always sure the benefit.) I thoughtI would summarise some of the issues below, as I was unsure what Andyintended.
As @jeremiedbb said, making n_jobs more effective would be beneficial.Reducing duplication between metrics.pairwise andneighbors._dist_metrics and kmeans would be noble (especially withregard to parameters, where scicpy.spatial's mahalanobis availablethrough sklearn.metrics does not accept V but sklearn.neighbors does).and perhaps offer higher consistency of results and efficiencies.
We also have idioms the code like "if the metric is euclidean, usesquared=True where we only need a ranking, then take the squareroot"while neighbors metrics abstract this with an API by providing rdsitand rdist_to_dist.
There are issues about making sure thatpairwise_distances(metric='minkowski', p=2) is using the sameimplementation as pairwise_distances(metric='euclidean'), etc.
We have issues with chunking and distributing computations in the casethat metric params are derived from the dataset (ideally a training set).
#16419 is a simple instance where the metric param is sample-alignedand needs to be chunked up.
In other cases, we precompute some metric param over all the data,then pass it to each chunk worker, using _precompute_metric_paramsintroduced in #12672. This is also relevant to #9555.
While that initial implementation in #12672 is helpful and aims tomaintain backwards compatibility, it makes some dubious choices.
Firstly in terms of code structure it is not a very modular approach -each metric is handled with an if-then. Secondly, it *only* handlesthe chunking case, relying on the fact that these metrics are inscipy.spatial, and have a comparable handling of V=None and VI=None.In the Gower Distances PR (#9555) when implementing a metric locally,rather than relying on scipy.spatial, we needed to provide animplementation of these default parameters both when the data ischunked and when the metric function is called straight out.
Thirdly, its approach to training vs test data is dubious. We don'tformally label X and Y in pairwise_distances as train/test, andperhaps we should. Maintaining backwards compat with scipy'sseuclidean and mahalanobis, our implementation stacks X and Y to eachother if both are provided, and then calculates their variance. Thismeans that users may be applying a different metric at train and attest time (if the variance of X as train and Y as test issubstantially different), which I consider a silent error. We caneither make the train/test nature of X and Y more explicit, or we canrequire that data-based parameters are given explicitly by the userand not implicitly computed. If I understand correctly,sklearn.neighbors will not compute V or VI for you, and it must beprovided explicitly. (Requiring that the scaling of each feature begiven explicitly in Gower seems like an unnecessary burden on theuser, however.)
Then there are issues like whether we should consistently set thediagonal to zero in all metrics where Y=None.
In short, there are several projects in distances, and I'd supportthem being considered for work.... But it's a lot of engineering, ifmotivated by ML needs and consistency for users.
J

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] distances

Reply via email to