[
https://issues.apache.org/jira/browse/SPARK-38584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
zhengruifeng updated SPARK-38584:
---------------------------------
Description:
1, input vector validation is missing in most algorithms, when the input
dataset contains some invalid values (NaN/Infinity), then:
* the training may run successfuly and return model containing invalid
coefficients, like LinearSVC
* the training may fail with irrelevant message, like KMeans
{code:java}
import org.apache.spark.ml.feature._
import org.apache.spark.ml.linalg._
import org.apache.spark.ml.classification._
import org.apache.spark.ml.clustering._
val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, Double.NaN)),
LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, 2.0)))).toDF()
val svc = new LinearSVC()
val model = svc.fit(df)
scala> model.intercept
res0: Double = NaN
scala> model.coefficients
res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN]
val km = new KMeans().setK(2)
scala> km.fit(df)
22/03/17 14:29:10 ERROR Executor: Exception in task 11.0 in stage 10.0 (TID 113)
java.lang.IllegalArgumentException: requirement failed: Both norms should be
greater or equal to 0.0, found norm1=NaN, norm2=Infinity
at scala.Predef$.require(Predef.scala:281)
at
org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:543)
{code}
2, relative methods to validate input dataset (like labels/weights) exists in
different files:
{{{}org.apache.spark.ml.functions{}}}, org.apache.spark.ml.util.DatasetUtils,
org.apache.spark.ml.util.MetadataUtils,
org.apache.spark.ml.Predictor, etc.
I think it is time to unify realtive methods to one source file.
was:
1, input vector validation is missing in most algorithms, when the input
dataset contains some invalid values (NaN/Infinity), then:
* the training may run successfuly and return model invalid coefficients, like
LinearSVC
* the training will fail with irrelevant message, like KMeans
{code:java}
import org.apache.spark.ml.feature._
import org.apache.spark.ml.linalg._
import org.apache.spark.ml.classification._
import org.apache.spark.ml.clustering._
val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, Double.NaN)),
LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, 2.0)))).toDF()
val svc = new LinearSVC()
val model = svc.fit(df)
scala> model.intercept
res0: Double = NaN
scala> model.coefficients
res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN]
val km = new KMeans().setK(2)
scala> km.fit(df)
22/03/17 14:29:10 ERROR Executor: Exception in task 11.0 in stage 10.0 (TID 113)
java.lang.IllegalArgumentException: requirement failed: Both norms should be
greater or equal to 0.0, found norm1=NaN, norm2=Infinity
at scala.Predef$.require(Predef.scala:281)
at
org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:543)
{code}
2, relative methods to validate input dataset (like labels/weights) exists in
different files:
{{{}org.apache.spark.ml.functions{}}}, org.apache.spark.ml.util.DatasetUtils,
org.apache.spark.ml.util.MetadataUtils,
org.apache.spark.ml.Predictor, etc.
I think it is time to unify realtive methods to one source file.
> Unify the data validation
> -------------------------
>
> Key: SPARK-38584
> URL: https://issues.apache.org/jira/browse/SPARK-38584
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Affects Versions: 3.4.0
> Reporter: zhengruifeng
> Assignee: zhengruifeng
> Priority: Major
>
> 1, input vector validation is missing in most algorithms, when the input
> dataset contains some invalid values (NaN/Infinity), then:
> * the training may run successfuly and return model containing invalid
> coefficients, like LinearSVC
> * the training may fail with irrelevant message, like KMeans
>
> {code:java}
> import org.apache.spark.ml.feature._
> import org.apache.spark.ml.linalg._
> import org.apache.spark.ml.classification._
> import org.apache.spark.ml.clustering._
> val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0,
> Double.NaN)), LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity,
> 2.0)))).toDF()
> val svc = new LinearSVC()
> val model = svc.fit(df)
> scala> model.intercept
> res0: Double = NaN
> scala> model.coefficients
> res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN]
> val km = new KMeans().setK(2)
> scala> km.fit(df)
> 22/03/17 14:29:10 ERROR Executor: Exception in task 11.0 in stage 10.0 (TID
> 113)
> java.lang.IllegalArgumentException: requirement failed: Both norms should be
> greater or equal to 0.0, found norm1=NaN, norm2=Infinity
> at scala.Predef$.require(Predef.scala:281)
> at
> org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:543)
> {code}
>
> 2, relative methods to validate input dataset (like labels/weights) exists in
> different files:
> {{{}org.apache.spark.ml.functions{}}}, org.apache.spark.ml.util.DatasetUtils,
> org.apache.spark.ml.util.MetadataUtils,
> org.apache.spark.ml.Predictor, etc.
>
> I think it is time to unify realtive methods to one source file.
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]