[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...

mengxr Wed, 30 May 2018 23:47:43 -0700

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21265#discussion_r192002383
  
    --- Diff: python/pyspark/ml/fpm.py ---
    @@ -243,3 +244,75 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, 
itemsCol="items",
     
         def _create_model(self, java_model):
             return FPGrowthModel(java_model)
    +
    +
    +class PrefixSpan(object):
    +    """
    +    .. note:: Experimental
    +
    +    A parallel PrefixSpan algorithm to mine frequent sequential patterns.
    +    The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: 
Mining Sequential Patterns
    +    Efficiently by Prefix-Projected Pattern Growth
    +    (see <a href="http://doi.org/10.1109/ICDE.2001.914830";>here</a>).
    +
    +    .. versionadded:: 2.4.0
    +
    +    """
    +    @staticmethod
    +    @since("2.4.0")
    +    def findFrequentSequentialPatterns(dataset,
    +                                       sequenceCol,
    +                                       minSupport,
    +                                       maxPatternLength,
    +                                       maxLocalProjDBSize):
    +        """
    +        .. note:: Experimental
    +        Finds the complete set of frequent sequential patterns in the 
input sequences of itemsets.
    +
    +        :param dataset: A dataset or a dataframe containing a sequence 
column which is
    +                        `Seq[Seq[_]]` type.
    +        :param sequenceCol: The name of the sequence column in dataset, 
rows with nulls in this
    +                            column are ignored.
    +        :param minSupport: The minimal support level of the sequential 
pattern, any pattern that
    +                           appears more than (minSupport * 
size-of-the-dataset) times will be
    +                           output (recommended value: `0.1`).
    +        :param maxPatternLength: The maximal length of the sequential 
pattern
    +                                 (recommended value: `10`).
    +        :param maxLocalProjDBSize: The maximum number of items (including 
delimiters used in the
    +                                   internal storage format) allowed in a 
projected database before
    +                                   local processing. If a projected 
database exceeds this size,
    +                                   another iteration of distributed prefix 
growth is run
    +                                   (recommended value: `32000000`).
    +        :return: A `DataFrame` that contains columns of sequence and 
corresponding frequency.
    +                 The schema of it will be:
    +                  - `sequence: Seq[Seq[T]]` (T is the item type)
    +                  - `freq: Long`
    +
    +        >>> from pyspark.ml.fpm import PrefixSpan
    +        >>> from pyspark.sql import Row
    +        >>> df = sc.parallelize([Row(sequence=[[1, 2], [3]]),
    --- End diff --
    
    We should keep doctest examples simple to read. For example, including 
`maxLocalProjDBSize` is not useful because we don't expect users to tuning this 
param often.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...

Reply via email to