[
https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226562#comment-14226562
]
Jacky Li commented on SPARK-4001:
---------------------------------
Thanks for your suggestion, Daniel.
Here is the current status.
1. Currently I have implemented apriori and fp-growth by referring to YAFIM
(http://pasa-bigdata.nju.edu.cn/people/ronggu/pub/YAFIM_ParLearning.pdf) and
PFP (http://dl.acm.org/citation.cfm?id=1454027)
For apriori, currently there are two versions implemented, one using broadcast
variable and another one using cartisian join of two RDD, I am testing them
using mushroom and webdoc open dataset (http://fimi.ua.ac.be/data/) to check
the performance of them before deciding which one to contribute to MLlib.
I have updated the code in the PR (https://github.com/apache/spark/pull/2847),
you are welcome to check it and try in your use case.
2. For the input part, currently the apriori algo is taking
{{RDD\[Array\[String\]\]}} as the input dataset, but not containing basket_id
or user_id. I am not sure whether it can easily fit into your use case. Can you
give more detail of how you want to use it in collaborative filtering contexts?
> Add Apriori algorithm to Spark MLlib
> ------------------------------------
>
> Key: SPARK-4001
> URL: https://issues.apache.org/jira/browse/SPARK-4001
> Project: Spark
> Issue Type: New Feature
> Components: MLlib
> Reporter: Jacky Li
> Assignee: Jacky Li
>
> Apriori is the classic algorithm for frequent item set mining in a
> transactional data set. It will be useful if Apriori algorithm is added to
> MLLib in Spark
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]