[ 
https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226562#comment-14226562
 ] 

Jacky Li commented on SPARK-4001:
---------------------------------

Thanks for your suggestion, Daniel. 
Here is the current status.
1. Currently I have implemented apriori and fp-growth by referring to YAFIM 
(http://pasa-bigdata.nju.edu.cn/people/ronggu/pub/YAFIM_ParLearning.pdf) and 
PFP (http://dl.acm.org/citation.cfm?id=1454027) 
For apriori, currently there are two versions implemented, one using broadcast 
variable and another one using cartisian join of two RDD, I am testing them 
using mushroom and webdoc open dataset (http://fimi.ua.ac.be/data/) to check 
the performance of them before deciding which one to contribute to MLlib.
I have updated the code in the PR (https://github.com/apache/spark/pull/2847), 
you are welcome to check it and try in your use case.
2. For the input part, currently the apriori algo is taking  
{{RDD\[Array\[String\]\]}} as the input dataset, but not containing basket_id 
or user_id. I am not sure whether it can easily fit into your use case. Can you 
give more detail of how you want to use it in collaborative filtering contexts? 


> Add Apriori algorithm to Spark MLlib
> ------------------------------------
>
>                 Key: SPARK-4001
>                 URL: https://issues.apache.org/jira/browse/SPARK-4001
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Jacky Li
>            Assignee: Jacky Li
>
> Apriori is the classic algorithm for frequent item set mining in a 
> transactional data set.  It will be useful if Apriori algorithm is added to 
> MLLib in Spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to