Re: [Scikit-learn-general] Parallel computing of Mahalanobis distances

Sturla Molden Fri, 21 Feb 2014 15:23:29 -0800

On 21/02/14 14:07, Javier Martínez-López wrote:
> Hello,
>
> I am quite new to python, so I am sorry if I am asking something too
> simple: I am trying to speed up a raster processing with python and I
> want to use the "sklearn.metrics.pairwise.pairwise_distances" module
> with the mahalanobis distance metric and the " n_jobs=-1" in order to
> speed up the process using all processors. However, when I run it, it
> only uses one processor... while if I use the euclidean distance all
> processors are used... where is the problem? am I doing something
> wrong?



On 21/02/14 14:07, Javier Martínez-López wrote:> Hello,
 >
 > I am quite new to python, so I am sorry if I am asking something too
 > simple: I am trying to speed up a raster processing with python and I
 > want to use the "sklearn.metrics.pairwise.pairwise_distances" module
 > with the mahalanobis distance metric and the " n_jobs=-1" in order to
 > speed up the process using all processors. However, when I run it, it
 > only uses one processor... while if I use the euclidean distance all
 > processors are used... where is the problem? am I doing something
 > wrong?
 >
 > Thank you and cheers,
 >
 > Javier

I does not seem it is vectorized up with joblib. It is just deferred to 
scipy.spatial.distances.mahalanobis.

Assuming X is ndata x nparam array, m is the mean vector and S is the 
covariance matrix, you can compute mahalanobis distances in parallel 
like so:

import numpy as np
import scipy
from scipy.linalg import cholesky, solve_triangular
from sklearn.externals.joblib import Parallel, delayed
from multiprocessing import cpu_count

def _schedule(n, nproc):
     """ guided scheduler """
     start = 0
     size = (n - start) // nproc
     while size > 100:
         yield slice(start,start+size)
         start += size
         size = (n - start) // nproc
     yield slice(start,n+1)
     return

def _mahalanobis_distances(m, L, X):
     cX = X - m[np.newaxis,:]
     tmp = solve_triangular(L, cX.T, lower=True).T
     tmp **= 2
     return np.sqrt(tmp.sum(axis=1))

def mahalanobis_distances(m, S, X, parallel=True):
     L = cholesky(S, lower=True)
     n = X.shape[0]
     if parallel:
         nproc = cpu_count()
         res = (Parallel(n_jobs=-1)
                (delayed(_mahalanobis_distances)
                  (m, L, X[s,:])
                    for s in _schedule(n,nproc)))
         return np.hstack(res)
     else:
         return _mahalanobis_distances(m, L, X)


# scipy.spatial.distance.mahalanobis for comparison

from scipy.spatial import distance

def _mahalanobis_distances_scipy(m, SI, X):
     n = X.shape[0]
     mahal = np.zeros(n)
     for i in xrange(X.shape[0]):
         x = X[i,:]
         mahal[i] = distance.mahalanobis(x,m,SI)
     return mahal

def mahalanobis_distances_scipy(m, S, X, parallel=True):
     SI = np.linalg.inv(S)
     n = X.shape[0]
     if parallel:
         nproc = cpu_count()
         res = (Parallel(n_jobs=-1)
                (delayed(_mahalanobis_distances_scipy)
                 (m, SI, X[s,:])
                   for s in _schedule(n,nproc)))
         return np.hstack(res)
     else:
         return _mahalanobis_distances_scipy(m, SI, X)



if __name__ == "__main__":

     ## verify accuracy

     X = np.random.randn(300000*16).reshape((300000,16))
     m = np.mean(X, axis=0)
     S = np.cov(X, rowvar=False)

     mahal_scipy = mahalanobis_distances_scipy(m, S, X)
     mahal_cholesky = mahalanobis_distances(m, S, X)

     print("Similar result to scipy.spatial.distance.mahalanobis: %s" %
        ('true' if np.allclose(mahal_scipy - mahal_cholesky,0) else 
'false',))

     mahal = (mahalanobis_distances(m, S, X, parallel=True)
        - mahalanobis_distances(m, S, X, parallel=False)
        + mahalanobis_distances_scipy(m, S, X, parallel=True)
        - mahalanobis_distances_scipy(m, S, X, parallel=False))

     print("Similar results with and without parallel execution: %s" %
        ('true' if np.allclose(mahal,0) else 'false',))

     # timing
     from time import clock

     print("Example timings:")

     t0 = clock()
     mahalanobis_distances(m, S, X, parallel=True)
     t1 = clock()
     print("cholesky, parallel: %f ms" % (1000*(t1-t0),))

     t0 = clock()
     mahalanobis_distances(m, S, X, parallel=False)
     t1 = clock()
     print("cholesky, single process: %f ms" % (1000*(t1-t0),))

     t0 = clock()
     mahalanobis_distances_scipy(m, S, X, parallel=True)
     t1 = clock()
     print("scipy, parallel: %f ms" % (1000*(t1-t0),))

     t0 = clock()
     mahalanobis_distances_scipy(m, S, X, parallel=False)
     t1 = clock()
     print("scipy, single process: %f ms" % (1000*(t1-t0),))


This gives me:

$ python mahal.py
Similar result to scipy.spatial.distance.mahalanobis: true
Similar results with and without parallel execution: true
Example timings:
cholesky, parallel: 122.138000 ms
cholesky, single process: 159.878000 ms
scipy, parallel: 308.829000 ms
scipy, single process: 6744.384000 ms


Actually, there is a strange interaction with MKL threading here:

     try:
        import mkl
        mkl.set_num_threads(1)
     except ImportError:
        pass

Now I get these timings:

$ python mahal.py
Similar result to scipy.spatial.distance.mahalanobis: true
Similar results with and without parallel execution: true
Example timings:
cholesky, parallel: 120.784000 ms
cholesky, single process: 84.593000 ms
scipy, parallel: 129.127000 ms
scipy, single process: 6749.786000 ms


The version that uses a single MKL thread and a single process is 
actually the fastest.

I could probably get even better results using Cython, BLAS and LAPACK, 
but this can suffice for now.


Sturla












------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121054471&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Parallel computing of Mahalanobis distances

Reply via email to