Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17090
For performance tests, I've been using the MovieLens `ml-latest` dataset
[here](https://grouplens.org/datasets/movielens/). It has `24,404,096` ratings
with `259,137` users and `39,443` movies.
So it's not enormous but "recommend all" does a lot of work - generating
`1,631,206,099` predicted ratings raw before the `top-k`.
Some quick tests for the existing `recommendProductsForUsers` gives `306
sec`.
```
scala> spark.time { oldModel.recommendProductsForUsers(k).count }
Time taken: 306512 ms
res11: Long = 259137
```
As part of my performance testing I've tried a few approaches roughly
similar to this PR, but using `Window` and `filter` rather than this top-k
aggregator (which is a neat idea).
At first I thought this PR was really good:
```
scala> spark.time { newModel.recommendForAllUsers(k).count }
Time taken: 151504 ms
res3: Long = 259137
```
`151 sec` seems fast!
But then I tried this:
```
scala> spark.time { newModel.recommendForAllUsers(k).show }
+------+--------------------+
|userId| recommendations|
+------+--------------------+
| 35982|[[131382,15.53116...|
| 67782|[[131382,29.72169...|
| 82672|[[132954,12.19152...|
|155042|[[148954,16.09084...|
|167532|[[118942,13.94282...|
|168802|[[27212,11.881494...|
|216112|[[109159,25.46359...|
|243392|[[153010,9.85302]...|
|255132|[[131382,15.50626...|
|255362|[[131382,10.08476...|
| 17389|[[152711,16.09958...|
|120899|[[156956,12.61003...|
|213089|[[82055,13.293286...|
|253769|[[152711,16.57459...|
|258129|[[152711,22.50499...|
| 24347|[[152711,12.31282...|
| 35947|[[153184,11.04110...|
|103357|[[132954,13.26898...|
|130557|[[118942,14.00168...|
|156017|[[153010,12.24449...|
+------+--------------------+
only showing top 20 rows
Time taken: 672524 ms
```
`672 sec`, over 2x slower than `mllib` impl.
Not sure why `count` is fast relative to `show` (maybe Spark SQL is not
doing all the actual compute, while for `show` it does need to?).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]