[GitHub] spark pull request #20367: [SPARK-23166][ML] Add maxDF Parameter to CountVec...

ymazari Sat, 27 Jan 2018 08:48:06 -0800

Github user ymazari commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20367#discussion_r164275764
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
    @@ -155,24 +182,48 @@ class CountVectorizer @Since("1.5.0") 
(@Since("1.5.0") override val uid: String)
         transformSchema(dataset.schema, logging = true)
         val vocSize = $(vocabSize)
         val input = 
dataset.select($(inputCol)).rdd.map(_.getAs[Seq[String]](0))
    +    val countingRequired = $(minDF) < 1.0 || $(maxDF) < 1.0
    +    val maybeInputSize = if (countingRequired) {
    +      Some(input.cache().count())
    +    } else {
    +      None
    +    }
         val minDf = if ($(minDF) >= 1.0) {
           $(minDF)
         } else {
    -      $(minDF) * input.cache().count()
    +      $(minDF) * maybeInputSize.get
         }
    -    val wordCounts: RDD[(String, Long)] = input.flatMap { case (tokens) =>
    +    val maxDf = if ($(maxDF) >= 1.0) {
    +      $(maxDF)
    +    } else {
    +      $(maxDF) * maybeInputSize.get
    +    }
    +    require(maxDf >= minDf, "maxDF must be >= minDF.")
    +    val allWordCounts = input.flatMap { case (tokens) =>
           val wc = new OpenHashMap[String, Long]
           tokens.foreach { w =>
             wc.changeValue(w, 1L, _ + 1L)
           }
           wc.map { case (word, count) => (word, (count, 1)) }
         }.reduceByKey { case ((wc1, df1), (wc2, df2)) =>
           (wc1 + wc2, df1 + df2)
    -    }.filter { case (word, (wc, df)) =>
    -      df >= minDf
    -    }.map { case (word, (count, dfCount)) =>
    -      (word, count)
    -    }.cache()
    +    }
    +
    +    val filteringRequired = isSet(minDF) || isSet(maxDF)
    --- End diff --
    
    Making a variable here for the sake of clarity.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20367: [SPARK-23166][ML] Add maxDF Parameter to CountVec...

Reply via email to