Github user jtengyp commented on the issue:
https://github.com/apache/spark/pull/17742
I did some tests with the PR.
Here is the cluster configure:
3 workers, each has 10 cores and 30G memory.
With the netflix dataset (480,189 users and 17770 movies), the
recommendProductsForUsers time reduces from 488.36s to 60.93s, 8x faster than
the original method.
With a larger dataset (3.29million users and 0.21 million products), the
recommendProductsForUsers time reduces from 48h to 39min, 73x faster than the
original method.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]