Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17090
I should note that I've found the performance of "recommend all" to be very
dependent on number of partitions since it controls the memory consumption per
task (which can easily explode in the blocked `mllib` version) vs the CPU
utilization & amount of shuffle data.
For example, the default `mllib` results above use `192*192` = `36,864`
partitions (due to cross-join). So it does prevent dying due to exploding
memory & GC but is slower than using fewer partitions. However, too few
partitions and it dies.
I actually just realised that the defaults for `mllib` for user/item blocks
- which in turn controls the partitions for the factors - is
`defaultParallelism` (192 for my setup), while for `ml` it is `10`. Hence we
need to create a like-for-like comparison.
(Side note - it's not ideal actually that the num blocks drives the
recommend-all partitions - because the optimal settings for training ALS are
unlikely to be optimal for batch recommend-all prediction).
Anyway some results of quick tests on my setup:
Firstly, to match `mllib` defaults:
`mllib` with 192*192: `323 sec`
```
scala> spark.time { oldModel.recommendProductsForUsers(k).foreach(_ =>
Unit) }
Time taken: 323367 ms
```
`ml` with 192*192: `427 sec`
```
scala> val newModel =
newAls.setNumUserBlocks(192).setNumItemBlocks(192).fit(ratings)
scala> spark.time { newModel.recommendForAllUsers(k).foreach(_ => Unit) }
Time taken: 427174 ms
```
So this PR is 30% slower - which is actually pretty decent given it's not
using blocked BLAS operators.
*Note* I didn't use netlib native BLAS, which could make a large difference
when using level 3 BLAS in the blocked `mllib` version.
Secondly, to match `ml` defaults:
`mllib` with 10*10: `1654 sec`
```
scala> val oldModel = OldALS.train(oldRatings, rank, iter, lambda, 10)
scala> spark.time { oldModel.recommendProductsForUsers(k).foreach(_ =>
Unit) }
Time taken: 1654951 ms
```
`ml` with 10*10: `438 sec`
```
scala> val newModel = newAls.fit(ratings)
scala> spark.time { newModel.recommendForAllUsers(k).foreach(_ => Unit) }
Time taken: 438328 ms
```
In this case, the `mllib` version blows up with memory & GC dominating
runtime, and this PR is over 3x faster (though it varies a lot: 600 sec above,
438 sec here, etc).
Finally, middle of the road case:
`mllib` with 96*96: `175 sec`
```
scala> spark.time { oldModel.recommendProductsForUsers(k).foreach(_ =>
Unit) }
Time taken: 175880 ms
```
`ml` with 96*96: `181 sec`
```
scala> spark.time { newModel.recommendForAllUsers(k).foreach(_ => Unit) }
Time taken: 181494 ms
```
So a few % slower. Again pretty good actually considering it's not a
blocked implementation. Still room to be optimized.
After running these I tested against a blocked version using `DataFrame`
(to more or less match the current `mllib` version) and it's much faster in the
`192*192` case, a bit slower in `96*96` case and also blows up in the `10*10`
case. Again, really dependent on partitioning...
So the performance here is not too bad. The positive is it should avoid
completely exploding. As I mentioned above I tried a similar DataFrame-based
version using `Window` & `filter` and performance was terrible. It will be
interesting to see if native BLAS adds anything.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]