[GitHub] spark pull request: [SPARK-9898][MLlib] Prefix Span user guide

feynmanliang Mon, 17 Aug 2015 16:06:02 -0700

Github user feynmanliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8253#discussion_r37247702
  
    --- Diff: docs/mllib-frequent-pattern-mining.md ---
    @@ -96,3 +96,92 @@ for (FPGrowth.FreqItemset<String> itemset: 
model.freqItemsets().toJavaRDD().coll
     
     </div>
     </div>
    +
    +## Prefix Span
    +
    +Prefix Span is a sequential pattern mining algorithm described in
    +[Mortazavi-Asl et al., Mining Sequential Patterns by Pattern-Growth: The
    +PrefixSpan Approach](http://dx.doi.org/10.1109%2FTKDE.2004.77). We refer
    +the reader to the referenced paper for formalizing the sequential
    +pattern mining problem.
    +
    +MLlib's FP-growth implementation takes the following parameters:
    +
    +* `minSupport`: the minimum support required to be considered a frequent
    +  sequential pattern.
    +* `maxPatternLength`: the maximum length of a frequent sequential
    +  pattern. Any frequent pattern exceeding this length will not be
    +  included in the results.
    +* `maxLocalProjDBSize`: the maximum number of items allowed in a
    +  prefix-projected database before local iterative processing of the
    +  projected databse begins. This parameter should be tuned with respect
    +  to the size of your executors.
    +
    +
    +
    +**Examples**
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +
    +[`PrefixSpan`](api/scala/index.html#org.apache.spark.mllib.fpm.PrefixSpan) 
implements the
    +Prefix Span algorithm.
    +Calling `PrefixSpan.run` returns a
    
+[`PrefixSpanModel`](api/scala/index.html#org.apache.spark.mllib.fpm.PrefixSpan)
    +that stores the frequent sequences with their frequencies.
    +
    +{% highlight scala %}
    +import org.apache.spark.mllib.fpm.PrefixSpan
    +
    +val sequences = Seq(
    +  Array(Array(1, 2), Array(3)),
    +  Array(Array(1), Array(3, 2), Array(1, 2)),
    +  Array(Array(1, 2), Array(5)),
    +  Array(Array(6)))
    +val rdd = sc.parallelize(sequences, 2).cache()
    +
    +val prefixSpan = new PrefixSpan()
    +  .setMinSupport(0.5)
    +  .setMaxPatternLength(5)
    +val model = prefixSpan.run(rdd)
    +model.freqSequences.collect().foreach { freqSequence =>
    +println(
    +  freqSequence.sequence.map(_.mkString("[", ", ", "]")).mkString("[", ", 
", "]") + ", " + freqSequence.freq)
    +}
    +{% endhighlight %}
    +
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +
    +[`PrefixSpan`](api/java/org/apache/spark/mllib/fpm/PrefixSpan.html) 
implements the
    +Prefix Span algorithm.
    +Calling `PrefixSpan.run` returns a
    
+[`PrefixSpanModel`](api/java/org/apache/spark/mllib/fpm/PrefixSpanModel.html)
    +that stores the frequent sequences with their frequencies.
    +
    +{% highlight java %}
    +import java.util.Arrays;
    +import java.util.List;
    +
    +import org.apache.spark.mllib.fpm.PrefixSpan;
    +import org.apache.spark.mllib.fpm.PrefixSpanModel;
    +
    +JavaRDD<List<List<Integer>>> sequences = sc.parallelize(Arrays.asList(
    --- End diff --
    
    Actually, I modified Scala example to directly create RDD as well



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-9898][MLlib] Prefix Span user guide

Reply via email to