Github user mengxr commented on a diff in the pull request:
https://github.com/apache/spark/pull/21265#discussion_r192002383
--- Diff: python/pyspark/ml/fpm.py ---
@@ -243,3 +244,75 @@ def setParams(self, minSupport=0.3, minConfidence=0.8,
itemsCol="items",
def _create_model(self, java_model):
return FPGrowthModel(java_model)
+
+
+class PrefixSpan(object):
+ """
+ .. note:: Experimental
+
+ A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+ The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan:
Mining Sequential Patterns
+ Efficiently by Prefix-Projected Pattern Growth
+ (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>).
+
+ .. versionadded:: 2.4.0
+
+ """
+ @staticmethod
+ @since("2.4.0")
+ def findFrequentSequentialPatterns(dataset,
+ sequenceCol,
+ minSupport,
+ maxPatternLength,
+ maxLocalProjDBSize):
+ """
+ .. note:: Experimental
+ Finds the complete set of frequent sequential patterns in the
input sequences of itemsets.
+
+ :param dataset: A dataset or a dataframe containing a sequence
column which is
+ `Seq[Seq[_]]` type.
+ :param sequenceCol: The name of the sequence column in dataset,
rows with nulls in this
+ column are ignored.
+ :param minSupport: The minimal support level of the sequential
pattern, any pattern that
+ appears more than (minSupport *
size-of-the-dataset) times will be
+ output (recommended value: `0.1`).
+ :param maxPatternLength: The maximal length of the sequential
pattern
+ (recommended value: `10`).
+ :param maxLocalProjDBSize: The maximum number of items (including
delimiters used in the
+ internal storage format) allowed in a
projected database before
+ local processing. If a projected
database exceeds this size,
+ another iteration of distributed prefix
growth is run
+ (recommended value: `32000000`).
+ :return: A `DataFrame` that contains columns of sequence and
corresponding frequency.
+ The schema of it will be:
+ - `sequence: Seq[Seq[T]]` (T is the item type)
+ - `freq: Long`
+
+ >>> from pyspark.ml.fpm import PrefixSpan
+ >>> from pyspark.sql import Row
+ >>> df = sc.parallelize([Row(sequence=[[1, 2], [3]]),
--- End diff --
We should keep doctest examples simple to read. For example, including
`maxLocalProjDBSize` is not useful because we don't expect users to tuning this
param often.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]