GitHub user mpjlu opened a pull request:
https://github.com/apache/spark/pull/18624
[SPARK-21389][ML][MLLIB] Optimize ALS recommendForAll by gemm with about
50% performance improvement
## What changes were proposed in this pull request?
In Spark 2.2, we have optimized ALS recommendForAll, which uses a
handwriting matrix multiplication, and get the topK items for each matrix. The
method effectively reduce the GC problem. However, Native BLAS GEMM, like Intel
MKL, and OpenBLAS, the performance of matrix multiplication is about 10X
comparing with handwriting method.
I have rewritten the code of recommendForAll with GEMM, and got about 50%
improvement comparing with the master recommendForAll method.
The key point of this optimization:
1), use GEMM to replace hand-written matrix multiplication.
2), Use matrix to keep temp result, largely reduce GC and computing time.
The master method create many small objects, which causes using GEMM directly
cannot get good performance.
3), Use sort and merge to get the topK items, which don't need to call
priority queue two times.
Test Result:
479818 users, 13727 products, rank = 10, topK = 20.
3 workers, each with 35 cores. Native BLAS is Intel MKL.
Block Size: 1000===2000===4000===8000
Master Method:40s-----39.4s-----39.5s----39.1s
This Method 26.5s---25.9s----26s-----27.1s
Performance Improvement: (OldTime - NewTime)/NewTime = about 50%
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise,
remove this)
Please review http://spark.apache.org/contributing.html before opening a
pull request.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/mpjlu/spark OptimizeAlsByGEMM
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/18624.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #18624
----
commit 5ca3fd1d8e9d5fa6ae0daf24c83e72ef96045104
Author: Peng Meng <[email protected]>
Date: 2017-07-13T07:33:45Z
add poll for PriorityQueue
commit 215efc3114012ebc19af984a3d0172aecb22f255
Author: Peng Meng <[email protected]>
Date: 2017-07-13T10:39:44Z
test pass
commit 7c587f4070c0951425d1686429816feb712c0273
Author: Peng Meng <[email protected]>
Date: 2017-07-13T11:08:41Z
fix bug
commit e8a40edb25db8a6ecdfe67bd54f38071e7a99781
Author: Peng Meng <[email protected]>
Date: 2017-07-13T11:56:50Z
code style change
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]