Hello folks,

I've just added an n_jobs option to the pairwise_distances and
pairwise_kernels functions. This works by breaking down the pairwise
matrix into "n_jobs" even slices and doing the computations in
parallel.

On the USPS dataset (n_samples=7291, n_features=257), I got the
following results:

sparse, n_jobs=1: 30.92
sparse, n_jobs=4: 10.17

dense, n_jobs=1: 7.64
dense, n_jobs=4: 4.75

I also added a bench using random data in
benchmarks/bench_plot_parallel_pairwise.py. Overall, it seems that the
memory copying implied by the use of "hstack" and "Parallel" is worth
the price, especially on larger datasets. On smaller ones, using only
one core may be slightly faster.

For simplicity, I prefer not to add an n_jobs option to the individual
metric functions (euclidean_distances, rbf_kernel, ...). Instead, just
use pairwise_distances(X, Y, metric="...", n_jobs=...) if you want to
do parallel computation.

We can now export an n_jobs attribute to the estimators that use
pairwise metrics (neighbors, kernel pca, ...). I'll leave that to the
maintainers of the respective modules.

Mathieu

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to