[jira] [Updated] (SPARK-20810) ML LinearSVC vs MLlib SVMWithSGD output different solution

Yanbo Liang (JIRA) Fri, 19 May 2017 03:38:18 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-20810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yanbo Liang updated SPARK-20810:
--------------------------------
    Description: 
Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} 
produces different solution compared with MLlib {{SVMWithSGD}}. I understand 
they use different optimization solver (OWLQN vs SGD), does it make sense to 
converge to different solution?
AFAIK, both of them use Hinge loss which is convex but not differentiable 
function. Since the derivative of the hinge loss at certain place is 
non-deterministic, should we switch to use squared hinge loss which is the 
default loss function of {{sklearn.svm.LinearSVC}}?
This issue is very easy to reproduce, you can paste the following code snippet 
to {{LinearSVCSuite}} and then click run in Intellij IDE.
{code}
test("LinearSVC vs SVMWithSGD") {
    import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
    import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint}

    val trainer1 = new LinearSVC()
      .setRegParam(0.00002)
      .setMaxIter(200)
      .setTol(1e-4)
    val model1 = trainer1.fit(binaryDataset)

    println(model1.coefficients)
    println(model1.intercept)

    val oldData = binaryDataset.rdd.map { case Row(label: Double, features: 
Vector) =>
        OldLabeledPoint(label, OldVectors.fromML(features))
    }
    val trainer2 = new SVMWithSGD().setIntercept(true)
    
trainer2.optimizer.setRegParam(0.00002).setNumIterations(200).setConvergenceTol(1e-4)

    val model2 = trainer2.run(oldData)

    println(model2.weights)
    println(model2.intercept)
  }
{code} 

The output is:
{code}
[7.24661385022775,14.774484832179743,22.00945617480461,29.558498069476084]
7.373454363024084
[0.58166680313823,1.1938960150473041,1.7940106824589588,2.4884300611292165]
0.667790514894194
{code}

  was:
Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} 
produces different solution compared with MLlib {{SVMWithSGD}}. I understand 
they use different optimization solver (OWLQN vs SGD), does it make sense to 
converge to different solution?
AFAIK, both of them use Hinge loss which is convex but not differentiable 
function. Since the derivative of the hinge loss at certain place is 
non-deterministic, should we switch to use squared hinge loss which is the 
default loss function of {{sklearn.svm.LinearSVC}}?
This issue is very easy to reproduce, you can paste the following code snippet 
to {{LinearSVCSuite}} and then click run in Intellij IDE.
{code}
test("LinearSVC vs SVMWithSGD") {
    import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
    import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint}

    val trainer1 = new LinearSVC()
      .setRegParam(0.00002)
      .setMaxIter(200)
      .setTol(1e-4)
    val model1 = trainer1.fit(binaryDataset)

    println(model1.coefficients)
    println(model1.intercept)

    val oldData = binaryDataset.rdd.map { case Row(label: Double, features: 
Vector) =>
        OldLabeledPoint(label, OldVectors.fromML(features))
    }
    val trainer2 = new SVMWithSGD().setIntercept(true)
    
trainer2.optimizer.setRegParam(0.00002).setNumIterations(200).setConvergenceTol(1e-4)

    val model2 = trainer2.run(oldData)

    println(model2.weights)
    println(model2.intercept)
  }
{code} 


> ML LinearSVC vs MLlib SVMWithSGD output different solution
> ----------------------------------------------------------
>
>                 Key: SPARK-20810
>                 URL: https://issues.apache.org/jira/browse/SPARK-20810
>             Project: Spark
>          Issue Type: Question
>          Components: ML, MLlib
>    Affects Versions: 2.2.0
>            Reporter: Yanbo Liang
>
> Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} 
> produces different solution compared with MLlib {{SVMWithSGD}}. I understand 
> they use different optimization solver (OWLQN vs SGD), does it make sense to 
> converge to different solution?
> AFAIK, both of them use Hinge loss which is convex but not differentiable 
> function. Since the derivative of the hinge loss at certain place is 
> non-deterministic, should we switch to use squared hinge loss which is the 
> default loss function of {{sklearn.svm.LinearSVC}}?
> This issue is very easy to reproduce, you can paste the following code 
> snippet to {{LinearSVCSuite}} and then click run in Intellij IDE.
> {code}
> test("LinearSVC vs SVMWithSGD") {
>     import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
>     import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint}
>     val trainer1 = new LinearSVC()
>       .setRegParam(0.00002)
>       .setMaxIter(200)
>       .setTol(1e-4)
>     val model1 = trainer1.fit(binaryDataset)
>     println(model1.coefficients)
>     println(model1.intercept)
>     val oldData = binaryDataset.rdd.map { case Row(label: Double, features: 
> Vector) =>
>         OldLabeledPoint(label, OldVectors.fromML(features))
>     }
>     val trainer2 = new SVMWithSGD().setIntercept(true)
>     
> trainer2.optimizer.setRegParam(0.00002).setNumIterations(200).setConvergenceTol(1e-4)
>     val model2 = trainer2.run(oldData)
>     println(model2.weights)
>     println(model2.intercept)
>   }
> {code} 
> The output is:
> {code}
> [7.24661385022775,14.774484832179743,22.00945617480461,29.558498069476084]
> 7.373454363024084
> [0.58166680313823,1.1938960150473041,1.7940106824589588,2.4884300611292165]
> 0.667790514894194
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-20810) ML LinearSVC vs MLlib SVMWithSGD output different solution

Reply via email to