[ 
https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226793#comment-14226793
 ] 

Daniel Erenrich commented on SPARK-4001:
----------------------------------------

First a minor point: I'd suggest making it take an RDD of arrays of ints? 
Holding tons of strings around seems wasteful. The user can maintain a map from 
strings to ints. Or else can we just make the array contain "comparables"?

So the main issue is whether we are trying to do "frequent itemset mining" or 
"association rule construction". I argue the latter is more common an operation 
and while the second requires the first there's no great extra cost to doing 
both. I'm actually unfamiliar with what can be done with just the former and 
not the latter.

If you equate baskets with users the connection between association rules and 
collaborative filtering becomes very clear.  I want to feed in someone's, say, 
movie viewing history and get reccomendations of the form "you watched X and Y 
so you'll really like Z" (where X,Y,Z is a frequent itemset).

The API could be made to match. Give me all the things this user bought in the 
format I described above and the prediction mode is "here are all of things 
this person bought please apply as many rules as you can". If a user does care 
about the frequent item sets that would be additionally stored inside the model.

The alternative here is to make a set of frequent itemset miners and then make 
an association rule learner that takes their output. The only downside is that 
that suffers some perf loss (requiring an additional pass). I'll gladly write 
this version if we decide that's the way we should go.

> Add Apriori algorithm to Spark MLlib
> ------------------------------------
>
>                 Key: SPARK-4001
>                 URL: https://issues.apache.org/jira/browse/SPARK-4001
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Jacky Li
>            Assignee: Jacky Li
>
> Apriori is the classic algorithm for frequent item set mining in a 
> transactional data set.  It will be useful if Apriori algorithm is added to 
> MLLib in Spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to