On 03/07/14 04:01, Sturla Molden wrote: > On 02/07/14 06:02, Valerio Maggio wrote: > >> You were right when you said that under the hood the main `for` loop >> iterates over the number of components, but in scikit this is >> not done *explicitly* via Python loops. > > Look at lines 596 and 692. > > https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/mixture/gmm.py >
By the way: This code is highly vectorized with np.dot, sp.linalg.cholesky and sp.linalg.solve_triangular. Those will call into BLAS and LAPACK and make sure there will be parallel computing. The problem with all NumPy code is that it's memory bound. That is also the reason the GMM code in scikit-learn don't scale up on multiple CPUs. The main optimization we could do to this code in lower-level language (Cython, Fortran, C) is to remove the memory bounding. That would also allow LAPACK and BLAS to scale up the computation, without any explicit parallel programming in the GMM code itself. The f2py overhead in sp.linalg.* can also be reduced with Cython. It is notorious for making transposed copied behind the scenes. It also happens in the GMM code here, from what I can tell. This basically happens whenever sp.linalg.* functions get an array which is not Fortran contiguous. np.dot is smart enough to hide this problem by using the appropriate transpose flags in BLAS, but np.linalg.* and sp.linalg.* are not. (They should be.) I would also like to say (while we're at it) that parallelizing this outside BLAS and LAPACK, whether with threads or processes, will require a memory overhead roughly equal to the size of the data array per thread or process. That is because computation of the covariance matrix and the likelihood needs to produce "X - mean" in a temporary array. This temporary array cannot be shared but should be reused within each thread or process. Sturla ------------------------------------------------------------------------------ Open source business process management suite built on Java and Eclipse Turn processes into business applications with Bonita BPM Community Edition Quickly connect people, data, and systems into organized workflows Winner of BOSSIE, CODIE, OW2 and Gartner awards http://p.sf.net/sfu/Bonitasoft _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
