[GitHub] [spark] ancasarb opened a new pull request #24509: Linear Regression - validate training related params such as loss only during fitting phase

GitBox Wed, 01 May 2019 14:00:56 -0700

ancasarb opened a new pull request #24509: Linear Regression - validate 
training related params such as loss only during fitting phase
URL: https://github.com/apache/spark/pull/24509
 
 
   ## What changes were proposed in this pull request?
   
   When transform(...) method is called on a LinearRegressionModel created 
directly with the coefficients and intercepts, the following exception is 
encountered.
   
   ```
   java.util.NoSuchElementException: Failed to find a default value for loss
        at 
org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780)
        at 
org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:779)
        at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42)
        at org.apache.spark.ml.param.Params$class.$(params.scala:786)
        at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42)
        at 
org.apache.spark.ml.regression.LinearRegressionParams$class.validateAndTransformSchema(LinearRegression.scala:111)
        at 
org.apache.spark.ml.regression.LinearRegressionModel.validateAndTransformSchema(LinearRegression.scala:637)
        at 
org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:192)
        at 
org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311)
        at 
org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311)
        at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
        at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
        at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
        at org.apache.spark.ml.PipelineModel.transformSchema(Pipeline.scala:311)
        at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
        at org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:305)
   ```
   
   This is because validateAndTransformSchema() is called both during training 
and scoring phases, but the checks against the training related params like 
loss should really be performed during training phase only, I think, please 
correct me if I'm missing anything :)
   
   This issue was first reported for mleap 
(https://github.com/combust/mleap/issues/455) because basically when we 
serialize the Spark transformers for mleap, we only serialize the params that 
are relevant for scoring. We do have the option to de-serialize the serialized 
transformers back into Spark for scoring again, but in that case, we no longer 
have all the training params. 
   
   ## How was this patch tested?
   Added a unit test to check this scenario.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ancasarb opened a new pull request #24509: Linear Regression - validate training related params such as loss only during fitting phase

Reply via email to