Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/20973#discussion_r185149879
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
@@ -44,26 +43,37 @@ object PrefixSpan {
*
* @param dataset A dataset or a dataframe containing a sequence column
which is
* {{{Seq[Seq[_]]}}} type
- * @param sequenceCol the name of the sequence column in dataset
+ * @param sequenceCol the name of the sequence column in dataset, rows
with nulls in this column
+ * are ignored
* @param minSupport the minimal support level of the sequential
pattern, any pattern that
* appears more than (minSupport *
size-of-the-dataset) times will be output
- * (default: `0.1`).
- * @param maxPatternLength the maximal length of the sequential pattern,
any pattern that appears
- * less than maxPatternLength will be output
(default: `10`).
+ * (recommended value: `0.1`).
+ * @param maxPatternLength the maximal length of the sequential pattern
+ * (recommended value: `10`).
* @param maxLocalProjDBSize The maximum number of items (including
delimiters used in the
* internal storage format) allowed in a
projected database before
* local processing. If a projected database
exceeds this size, another
- * iteration of distributed prefix growth is
run (default: `32000000`).
- * @return A dataframe that contains columns of sequence and
corresponding frequency.
+ * iteration of distributed prefix growth is
run
+ * (recommended value: `32000000`).
+ * @return A `DataFrame` that contains columns of sequence and
corresponding frequency.
+ * The schema of it will be:
+ * - `sequence: Seq[Seq[T]]` (T is the item type)
+ * - `frequency: Long`
--- End diff --
sure!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]