[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

WeichenXu123 Mon, 30 Apr 2018 18:09:49 -0700

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20973#discussion_r185149879
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
    @@ -44,26 +43,37 @@ object PrefixSpan {
        *
        * @param dataset A dataset or a dataframe containing a sequence column 
which is
        *                {{{Seq[Seq[_]]}}} type
    -   * @param sequenceCol the name of the sequence column in dataset
    +   * @param sequenceCol the name of the sequence column in dataset, rows 
with nulls in this column
    +   *                    are ignored
        * @param minSupport the minimal support level of the sequential 
pattern, any pattern that
        *                   appears more than (minSupport * 
size-of-the-dataset) times will be output
    -   *                  (default: `0.1`).
    -   * @param maxPatternLength the maximal length of the sequential pattern, 
any pattern that appears
    -   *                         less than maxPatternLength will be output 
(default: `10`).
    +   *                  (recommended value: `0.1`).
    +   * @param maxPatternLength the maximal length of the sequential pattern
    +   *                         (recommended value: `10`).
        * @param maxLocalProjDBSize The maximum number of items (including 
delimiters used in the
        *                           internal storage format) allowed in a 
projected database before
        *                           local processing. If a projected database 
exceeds this size, another
    -   *                           iteration of distributed prefix growth is 
run (default: `32000000`).
    -   * @return A dataframe that contains columns of sequence and 
corresponding frequency.
    +   *                           iteration of distributed prefix growth is 
run
    +   *                           (recommended value: `32000000`).
    +   * @return A `DataFrame` that contains columns of sequence and 
corresponding frequency.
    +   *         The schema of it will be:
    +   *          - `sequence: Seq[Seq[T]]` (T is the item type)
    +   *          - `frequency: Long`
    --- End diff --
    
    sure!



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

Reply via email to