Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r79889093
  
    --- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/NaiveBayesSuite.scala 
---
    @@ -150,6 +150,75 @@ class NaiveBayesSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defa
         validateProbabilities(featureAndProbabilities, model, "multinomial")
       }
     
    +  test("Naive Bayes Multinomial with weighted samples") {
    +    val (dataset, weightedDataset) = {
    +      val nPoints = 1000
    +      val piArray = Array(0.5, 0.1, 0.4).map(math.log)
    +      val thetaArray = Array(
    +        Array(0.70, 0.10, 0.10, 0.10), // label 0
    +        Array(0.10, 0.70, 0.10, 0.10), // label 1
    +        Array(0.10, 0.10, 0.70, 0.10) // label 2
    +      ).map(_.map(math.log))
    +      val pi = Vectors.dense(piArray)
    +      val theta = new DenseMatrix(3, 4, thetaArray.flatten, true)
    +
    +      val testData = generateNaiveBayesInput(piArray, thetaArray, nPoints, 
42, "multinomial")
    +
    +      // Let's over-sample the label-1 samples twice, label-2 samples 
triple.
    +      val data1 = testData.flatMap { case labeledPoint: LabeledPoint =>
    +        labeledPoint.label match {
    +          case 0.0 => Iterator(labeledPoint)
    +          case 1.0 => Iterator(labeledPoint, labeledPoint)
    +          case 2.0 => Iterator(labeledPoint, labeledPoint, labeledPoint)
    +        }
    +      }
    +
    +      val rnd = new Random(8392)
    +      val data2 = testData.flatMap { case LabeledPoint(label: Double, 
features: Vector) =>
    --- End diff --
    
    I submitted a pr to your pr, with the weighted tests. (Hopefully I've done 
that correctly). Actually, I also think it is nice to test a case where the 
majority of the samples are outliers, but have small weights so they should not 
affect the predictions. This is semi-automated in MLUtils, but since NaiveBayes 
requires a certain type of features (0/1 in some cases) I don't think it 
integrates nicely yet. I think we should create a JIRA to automate weighted 
testing where we can think about this all together. For now, this test should 
be sufficient.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to