[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...

BryanCutler Wed, 14 Mar 2018 15:09:55 -0700

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20777#discussion_r174624206
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
    @@ -70,19 +70,22 @@ private[feature] trait CountVectorizerParams extends 
Params with HasInputCol wit
       def getMinDF: Double = $(minDF)
     
       /**
    -   * Specifies the maximum number of different documents a term must 
appear in to be included
    -   * in the vocabulary.
    -   * If this is an integer greater than or equal to 1, this specifies the 
number of documents
    -   * the term must appear in; if this is a double in [0,1), then this 
specifies the fraction of
    -   * documents.
    +   * maxDF is used for removing terms that appear too frequently. It 
specifies the maximum number
    +   * of different documents a term could appear in to be included in the 
vocabulary.
    +   * If this is an integer greater than or equal to 1, this specifies the 
maximum number of
    +   * documents the term could appear in; if this is a double in [0,1), 
then this specifies the
    +   * maximum fraction of documents the term could appear in. A term 
appears more frequently
    +   * than maxDF will be removed.
    --- End diff --
    
    This sounds much better, but probably should use ignore instead of remove 
and might be good to just change the order of the sentence like this:
    
    ```
    Specifies the maximum number of different documents a term could appear in 
to be included
    in the vocabulary. A term that appears more than the threshold will be 
ignored. If this is an
    integer greater than or equal to 1, this specifies the maximum number of 
documents the term
    could appear in; if this is a double in [0,1), then this specifies the 
maximum fraction of
    documents the term could appear in.
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...

Reply via email to