[
https://issues.apache.org/jira/browse/FLINK-4961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15689805#comment-15689805
]
ASF GitHub Bot commented on FLINK-4961:
---------------------------------------
Github user thvasilo commented on the issue:
https://github.com/apache/flink/pull/2819
Hello @gaborhermann,
I really like the idea of introducing a `MatrixFactorization` interface
that we can then use for different specialized optimization algorithms. For the
question I'm afraid I can't be of much help, I'll read the relevant paper this
week and get back to you if I have any more comments.
1. Don't know enough about joins to answer this :/
2. For this we would need to test the two solutions you have proposed and
evaluate the performance. You have listed a couple of pros/cons there, maybe
you can elaborate?
3. If there is absolutely no benefit from using more blocks I don't see the
need to investigate further. We'll need to include these instructions in the
docs.
4. I think test hardening should be done in a different PRs (potentially
for more algorithms). For now manual tests should suffice.
> SGD for Matrix Factorization
> ----------------------------
>
> Key: FLINK-4961
> URL: https://issues.apache.org/jira/browse/FLINK-4961
> Project: Flink
> Issue Type: New Feature
> Components: Machine Learning Library
> Reporter: Gábor Hermann
> Assignee: Gábor Hermann
>
> We have started an implementation of distributed stochastic gradient descent
> for matrix factorization based on Gemulla et al. [1].
> The main problem with distributed SGD in general is the conflicting updates
> of the model variable. In case of matrix factorization we can avoid
> conflicting updates by carefully deciding in each iteration step which blocks
> of the rating matrix we should use to update the corresponding blocks of the
> user and item matrices (see Figure 1. in paper).
> Although a general SGD solver might seem relevant for this issue, we can do
> much better in the special case of matrix factorization. E.g. in case of a
> linear regression model, the model is broadcasted in every iteration. As the
> model is typically small in that case, we can only avoid conflicts by having
> a "global" model. Based on this, the general SGD solver is a different issue.
> To give more details, the algorithm works as follows.
> We randomly create user and item vectors, then randomly partition them into
> {{k}} user and {{k}} item blocks. Based on these factor blocks we partition
> the rating matrix to {{k * k}} blocks correspondingly.
> In one iteration step we choose {{k}} non-conflicting rating blocks, i.e. we
> should not choose two rating blocks simultaneously with the same user or item
> block. This is done by assigning a rating block ID to every user and item
> block. We match the user, item, and rating blocks by the current rating block
> ID, and update the user and item factors by the ratings locally. We also
> update the rating block ID for the factor blocks, thus in the next iteration
> we use other rating blocks to update the factors.
> In {{k}} iteration we sweep through the whole rating matrix of {{k * k}}
> blocks (so instead of {{numberOfIterationSteps}} iterations we should do {{k
> * numberOfIterationSteps}} iterations).
> [1] [http://people.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)