Sklearn folks,
imho the Right Thing for dense / sparse distances would be to combine
1) cython for L1 L2 Linf( dense, dense )
_distance_p in
http://svn.scipy.org/svn/scipy/trunk/scipy/spatial/ckdtree.pyx
~ 20 lines
2) pure python to expand sparse todense.
I believe -- correct me -- that dist( sparse, dense )
is much more common than dist( sparse, sparse ) anyway ?
(except hammingdist on long sparse bools).
Below is a cut at a cdist_sparse which just calls cdist(),
trivial enough to have a chance of being correct :)
As a side point, do ML people generally use L1 rather than
sensitive-to-outliers L2 ? Or custom metrics,
in which case we need plugin metrics like cdist anyway ?
def cdist_sparse( X, Y, **kwargs ):
""" -> cdist( X or Y may be sparse ), any metric """
# todense row at a time, very slow if both very sparse
sxy = 2*issparse(X) + issparse(Y)
if sxy == 0:
return cdist( X, Y, **kwargs )
d = np.empty( (X.shape[0], Y.shape[0]), np.float64 )
if sxy == 2:
for j, x in enumerate(X):
d[j] = cdist( x.todense(), Y, **kwargs ) [0]
elif sxy == 1:
for k, y in enumerate(Y):
d[:,k] = cdist( X, y.todense(), **kwargs ) [0]
else:
for j, x in enumerate(X):
for k, y in enumerate(Y):
d[j,k] = cdist( x.todense(), y.todense(), **kwargs ) [0]
return d
cheers
-- denis
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general