[ 
https://issues.apache.org/jira/browse/FLINK-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15650855#comment-15650855
 ] 

ASF GitHub Bot commented on FLINK-4613:
---------------------------------------

Github user gaborhermann commented on the issue:

    https://github.com/apache/flink/pull/2542
  
    Hi @thvasilo,
    
    Thanks for your thoughts! I agree we should perform a benchmark in the 
future. Furthermore, based on the results we could optimize the algorithm.
    
    I split up the test, and rebased to the current master. I checked the 
`java.Iterable` again, and commented at your original concern. I am afraid 
we'll have to use the `java.Iterable`.
    
    Regarding the expected results, I've only generated the small input data by 
hand. Before that I checked whether the Spark and Flink implementations 
converged to approximately same factor matrices (I only checked the value of 
the objective function, not the whole matrices). Because of the random 
initialization we cannot guarantee to have the same results, but there were 2-3 
points that both Spark and Flink converged to.
    
    There might be better methods for testing, but I considered this sufficient 
as the original `ALSITSuite` did nothing more. Of course, this test only checks 
whether the algorithm works the same way after some modifications (e.g. 
optimization), and does not check whether the algorithm initially works or not, 
but it's the same case with the original ALS. Do you know what is the assurance 
for the explicit ALS working good? (It must be good, as I also checked the 
results of the explicit ALS against Spark on toy-data.) AFAIK Spark generates 
random matrices of known rank, factorizes them, and checks whether the error is 
low (see their 
[ALSSuite](https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/recommendation/ALSSuite.scala)).
 In the future, it might be worth to follow their approach.


> Extend ALS to handle implicit feedback datasets
> -----------------------------------------------
>
>                 Key: FLINK-4613
>                 URL: https://issues.apache.org/jira/browse/FLINK-4613
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Gábor Hermann
>            Assignee: Gábor Hermann
>
> The Alternating Least Squares implementation should be extended to handle 
> _implicit feedback_ datasets. These datasets do not contain explicit ratings 
> by users, they are rather built by collecting user behavior (e.g. user 
> listened to artist X for Y minutes), and they require a slightly different 
> optimization objective. See details by [Hu et 
> al|http://dx.doi.org/10.1109/ICDM.2008.22].
> We do not need to modify much in the original ALS algorithm. See [Spark ALS 
> implementation|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala],
>  which could be a basis for this extension. Only the updating factor part is 
> modified, and most of the changes are in the local parts of the algorithm 
> (i.e. UDFs). In fact, the only modification that is not local, is 
> precomputing a matrix product Y^T * Y and broadcasting it to all the nodes, 
> which we can do with broadcast DataSets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to