[jira] [Commented] (SPARK-4001) Add Apriori algorithm to Spark MLlib

wangfei (JIRA) Tue, 21 Oct 2014 02:32:59 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178183#comment-14178183
 ]


wangfei commented on SPARK-4001:
--------------------------------



Thanks Sean Owen for explaining! Frequent itemset algorithm works by scanning 
the input data set, there is no probabilistic model in nature. 

To answer Xiangrui Meng’s earlier questions:
1.      These algorithm is used for finding major patterns / association rules 
in a data set. For a real use case, some analytic applications in telecom 
domain use them to find subscriber behavior from the data set combining service 
record, network traffic record, and demographic data. Please refer to this 
Chinese article for example:  http://www.ss-lw.com/wxxw-361.html
And, sometimes we use frequent itemset algorithm for preparing features input 
to other algorithm which selects feature and do other ML task like training a 
classifier, like this paper: http://dl.acm.org/citation.cfm?id=1401922, 

2.      Since Apriori is a basic algorithm for frequent itemset mining, I am 
not aware of any parallel implementation for it. But I think the algorithm fits 
Spark’s data parallel model since it only need to scan the input data set. And 
for FP-Growth, I do know there is a Parallel FP-Growth from Haoyuan Li: 
http://dl.acm.org/citation.cfm?id=1454027  . I think I probably will refer  to 
this paper to implement FP-Growth in Spark 

3.      The Apriori computation complexity is about O(N*k) where N is the 
number of item in input data and k is the depth of the frequent item tree to 
search. FP-Grwoth complexity is about O(N),  it is more efficient comparing to 
Apriori. For space efficiency, FP-growth is also more efficient than Apriori. 
But in case of smaller data and if frequent itemset is more, Apriori is more 
efficient. This is because FP-Growth need to construct a FP Tree out of the 
input data set, and it needs some time. And another advantage of Apriori is 
that it can output association rules while FP-Growth can not.

Although these two algorithms are basic algo (FP-Growth is more complex), I 
think it will be handy if mllib can include them since there is no frequent 
itemset mining algo in Spark yet, and especially in distributed environment. 
Please suggest how to handle this issue. Thanks a lot. 


> Add Apriori algorithm to Spark MLlib
> ------------------------------------
>
>                 Key: SPARK-4001
>                 URL: https://issues.apache.org/jira/browse/SPARK-4001
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Jacky Li
>            Assignee: Jacky Li
>
> Apriori is the classic algorithm for frequent item set mining in a 
> transactional data set.  It will be useful if Apriori algorithm is added to 
> MLLib in Spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4001) Add Apriori algorithm to Spark MLlib

Reply via email to