On 03/07/14 04:01, Sturla Molden wrote:
> On 02/07/14 06:02, Valerio Maggio wrote:
>
>> You were right when you said that under the hood the main `for` loop 
>> iterates over the number of components, but in scikit this is
>> not done *explicitly* via Python loops.
>
> Look at lines 596 and 692.
>
> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/mixture/gmm.py
>


By the way:

This code is highly vectorized with np.dot, sp.linalg.cholesky and 
sp.linalg.solve_triangular. Those will call into BLAS and LAPACK and 
make sure there will be parallel computing.

The problem with all NumPy code is that it's memory bound. That is also 
the reason the GMM code in scikit-learn don't scale up on multiple CPUs. 
The main optimization we could do to this code in lower-level language 
(Cython, Fortran, C) is to remove the memory bounding. That would also 
allow LAPACK and BLAS to scale up the computation, without any explicit 
parallel programming in the GMM code itself.

The f2py overhead in sp.linalg.* can also be reduced with Cython. It is 
notorious for making transposed copied behind the scenes. It also 
happens in the GMM code here, from what I can tell. This basically 
happens whenever sp.linalg.* functions get an array which is not Fortran 
contiguous. np.dot is smart enough to hide this problem by using the 
appropriate transpose flags in BLAS, but np.linalg.* and sp.linalg.* are 
not. (They should be.)

I would also like to say (while we're at it) that parallelizing this 
outside BLAS and LAPACK, whether with threads or processes, will require 
a memory overhead roughly equal to the size of the data array per thread 
or process. That is because computation of the covariance matrix and the 
likelihood needs to produce "X - mean" in a temporary array. This 
temporary array cannot be shared but should be reused within each thread 
or process.


Sturla




------------------------------------------------------------------------------
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to