[GitHub] spark pull request: [SPARK-7425] [ML] spark.ml Predictor should su...

sethah Tue, 29 Mar 2016 10:56:10 -0700

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10355#discussion_r57768696
  
    --- Diff: 
mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala ---
    @@ -17,14 +17,116 @@
     
     package org.apache.spark.ml.util
     
    -import org.apache.spark.ml.Model
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.ml.{Estimator, Model, PredictionModel, Predictor}
     import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.regression.Regressor
    +import org.apache.spark.ml.tree.impl.TreeTests
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.sql.{DataFrame, SQLContext}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
     
    -object MLTestingUtils {
    +object MLTestingUtils extends SparkFunSuite {
       def checkCopy(model: Model[_]): Unit = {
         val copied = model.copy(ParamMap.empty)
           .asInstanceOf[Model[_]]
         assert(copied.parent.uid == model.parent.uid)
         assert(copied.parent == model.parent)
       }
    +
    +  def checkPredictorAcceptAllNumericTypes[M <: PredictionModel[_, M], T <: 
Predictor[_, _, M]](
    --- End diff --
    
    Is there a reason to have two separate methods for predictor and estimator? 
I also tend to prefer _not_ separating the "accept" and "reject" checks, and 
instead have them in a single function call. This will reduce the complexity of 
having to decide which one to use (and why) and also reduce the number of 
function calls in future tests. I was able to get the tests working having a 
single method for accept/reject and estimator/predictor, like this:
    
    ```scala
      def checkEstimatorNumericTypes[M <: Model[M], T <: Estimator[M]](
          predictor: T,
          isClassification: Boolean,
          sqlContext: SQLContext)(check: (M, M) => Unit): Unit = {
        val dfs = if (isClassification) {
          genClassifDFWithNumericLabelCol(sqlContext)
        } else {
          genRegressionDFWithNumericLabelCol(sqlContext)
        }
        val expected = predictor.fit(dfs(DoubleType))
        val actuals = dfs.keys.filter(_ != DoubleType).map(t => 
predictor.fit(dfs(t)))
        actuals.foreach(actual => check(expected, actual))
    
        val dfWithStringLabels = generateDFWithStringLabelCol(sqlContext)
        val thrown = intercept[IllegalArgumentException] {
          predictor.fit(dfWithStringLabels)
        }
        assert(thrown.getMessage contains
          "Column label must be of type NumericType but was actually of type 
StringType")
      }
    ```
    
    I had to change the fit method of OneVsRest to include a call to 
`transformSchema(dataset.schema)` so that it would fail properly on non-numeric 
label column. I think it's better to change the fit method so that all ML 
estimators provide consistent errors. Also note that the `isClassification` 
flag is needed because MLPClassifier does not inherit from Classifier, and so 
it is not easy to infer that it is a classification estimator.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-7425] [ML] spark.ml Predictor should su...

Reply via email to