[ 
https://issues.apache.org/jira/browse/SPARK-8999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14650415#comment-14650415
 ] 

Xiangrui Meng commented on SPARK-8999:
--------------------------------------

[~srowen] Thanks for your feedback! PrefixSpan paper has ~2k citations and I 
can find implementations in many libraries, e.g., SPMF, R. I think it is fair 
to say the algorithm is popular in data mining. The question I had is whether 
we want to support sequences of itemsets instead of sequences of items. The 
former complicates both the API and the implementation. I asked the author of 
SPMF for advice. He said without itemset support it is called string mining, 
which should be efficiently handled by some other algorithms. So it seems that 
we should implement PrefixSpan as in the paper, which supports itemsets.

> Support non-temporal sequence in PrefixSpan
> -------------------------------------------
>
>                 Key: SPARK-8999
>                 URL: https://issues.apache.org/jira/browse/SPARK-8999
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.5.0
>            Reporter: Xiangrui Meng
>            Assignee: Zhang JiaJin
>            Priority: Critical
>             Fix For: 1.5.0
>
>
> In SPARK-6487, we assume that all items are ordered. However, we should 
> support non-temporal sequences in PrefixSpan. This should be done before 1.5 
> because it changes PrefixSpan APIs.
> We can use `Array[Array[Int]]` or follow SPMF to use `Array[Int]` and use -1 
> to mark itemset boundaries. The latter is more efficient for storage. If we 
> support generic item type, we can use null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to