Github user gaborhermann commented on the issue:
https://github.com/apache/flink/pull/2819
There are some open questions:
1. Should we optimize 3 way join? For now the join order is burnt into the
code, also we might be able to give hints for join strategies.
2. How should we handle empty blocks? When matching a rating block with the
current factor blocks there might be no rating block or no factor blocks with
that id, as the rating block corresponds to differnt user and item block at
every iteration. For now we do the join between the blocks with a `coGroup`,
and do basically a full-outer-join, because we need to change the rating block
ID for every factor block at each iteration. This might not be the most optimal
solution (see comments at `coGroup`), but I don't see a better one right now.
3. The number of blocks determine also the number of iterations. Therefore
the higher number of blocks degrade the performance. We conducted experiments
on a cluster that shows this:
see [plot for movielens
data](https://s18.postimg.org/txap3x9o9/movielens_blocks.png) and [for lfm_1b
data](https://s11.postimg.org/ysnonuer7/lfm1b_blocks.png). Based on this we
would recommend setting the number of blocks to the smallest possible that can
fit into memory (and at least the parallelism of the execution). There might be
some way to avoid this and break the computation to more blocks while doing the
same amount of iteration, but it's not trivial because of the possibly
conflicting user-item blocks (and why the paper uses this blocking in the
first-place). Should we investigate this further? With the recommended settings
(and given enough memory) the algorithm performs well (see the plots).
4. The testing data is made by hand to ensure changes to the code does not
change the algorithm. The algorithm produces good results on real data. The
question is whether we should make a more thorough testing mechanism for matrix
factorization (as proposed in the [PR for
iALS](https://github.com/apache/flink/pull/2542)) or is this kind of testing
sufficient?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---