zhengruifeng commented on a change in pull request #32124:
URL: https://github.com/apache/spark/pull/32124#discussion_r614494214



##########
File path: python/pyspark/ml/classification.py
##########
@@ -571,9 +571,9 @@ class LinearSVC(_JavaClassifier, _LinearSVCParams, 
JavaMLWritable, JavaMLReadabl
     >>> model.getMaxBlockSizeInMB()
     0.0
     >>> model.coefficients
-    DenseVector([0.0, -0.2792, -0.1833])
+    DenseVector([0.0, -1.0319, -0.5159])

Review comment:
       Firstly, this dataset only contains two instances, so I think the result 
maybe not reliable:
   
   R:
   ```r
   > library(e1071)
   > label <- factor((c(1.0, 0.0)))
   > features <- as.matrix(data.frame(c(1.0, 1.0), c(1.0, 2.0), c(1.0, 3.0)))
   > C <- 2.0 / 2 / 0.01
   > model <- svm(features, label, type='C', kernel='linear', cost=C, scale=F, 
tolerance=1e-4)
   > w <- -t(model$coefs) %*% model$SV
   > w
        c.1..1. c.1..2. c.1..3.
   [1,]       0     0.4     0.8
   > model$rho
   [1] -2.2
   > predict(model, features)
   1 2 
   1 0 
   Levels: 0 1
   ```
   
   
   master:
   ```
   >>> df = sc.parallelize([Row(label=1.0, features=Vectors.dense(1.0, 1.0, 
1.0)), Row(label=0.0, features=Vectors.dense(1.0, 2.0, 3.0))]).toDF()
   >>> svm = LinearSVC()
   >>> svm.setRegParam(0.01)
   LinearSVC_c0eb3d7e4ecb
   >>> model = svm.fit(df)
   21/04/16 09:17:11 ERROR OWLQN: Failure! Resetting history: 
breeze.optimize.NaNHistory: 
   >>> model.summary().objectiveHistory
   [1.0, 0.8483595765125, 0.7130226960596805, 0.6959063640335401, 
0.6931614154979393, 0.677666741843437, 0.5879414534831688, 0.0930620718342175, 
0.09048677587921486, 0.07102532249029628, 0.04374349680010344, 
0.023954120035139716, 0.023830521501263628, 0.02320692346265251, 
0.0215105944133274, 0.010552738391597573, 0.01051515768008, 
0.009931999904022892, 0.008750399708406655, 0.0069553439215890535, 
0.006868546522886791, 0.006515326943672154, 0.006347452798458171, 
0.006187248462236068, 0.006164709258006091, 0.006053452700831725, 
0.0060260493716176936, 0.005887552134192266, 0.005887136407436091, 
0.005880225305895262, 0.005859805832943488, 0.005843641492521756, 
0.005843641492521756, 0.00584486187964547, 0.005843267361152545, 
0.005843241969094577, 0.005843234493660629, 0.0058432258624429665, 
0.005843208081880606, 0.005843193072290253, 0.005843171629631225, 
0.005843141001368335, 0.005843114614745189, 0.005843059249359493, 
0.005843017763481445, 0.005842944768449553, 0.005842902553628102, 0.0
 05842759959850206, 0.0058426880723804935, 0.0058424738985869045, 
0.005842333714732286, 0.005842029622612579, 0.005841764628607474, 
0.005841335619271196, 0.005840857505437478, 0.005840248258184164, 
0.0058394259571752215, 0.005838527516717635, 0.00583821354404327, 
0.005837212690021832, 0.005837208597515555, 0.005836810329205446, 
0.005836776694348312, 0.00583640540121474, 0.00583635673616978, 
0.005836167737106047, 0.005836194033268012, 0.00583616729576746, 
0.0058361672269734615, 0.0058392815360454285, 0.005837679815125598, 
0.005836401523490996, 0.005837908002584265, 0.005836192352808301, 
0.005836670387508997, 0.005836202175777219, 0.005836200250798619, 
0.005838876113322196]
   >>> model.summary().totalIterations
   77
   >>> model.transform(df).show()
   +-----+-------------+--------------------+----------+
   |label|     features|       rawPrediction|prediction|
   +-----+-------------+--------------------+----------+
   |  1.0|[1.0,1.0,1.0]|[-1.0000039068924...|       1.0|
   |  0.0|[1.0,2.0,3.0]|[0.99999460575938...|       0.0|
   +-----+-------------+--------------------+----------+
   ```
   
   
   this PR:
   ```
   >>> from pyspark.sql import Row
   >>> from pyspark.ml.linalg import Vectors
   >>> from pyspark.ml.classification import LinearSVC
   >>> 
   >>> df = sc.parallelize([Row(label=1.0, features=Vectors.dense(1.0, 1.0, 
1.0)), Row(label=0.0, features=Vectors.dense(1.0, 2.0, 3.0))]).toDF()
   >>> svm = LinearSVC()
   >>> svm.setRegParam(0.01)
   LinearSVC_09863ffadcb7
   >>> model = svm.fit(df)
   >>> model.summary().objectiveHistory
   [1.0, 0.02084985122275127, 0.013524781160976402, 0.007607689403049228, 
0.005824637199209566, 0.005323658700671529, 0.01838013981356318, 
0.005089779695828527, 0.009036497849295026, 0.005023675663196612, 
0.006332321775223654, 0.005004910245203961, 0.005002395070241166, 
0.005000981376776186, 0.005042760733060569, 0.005000267637684085, 
0.005013305173582782, 0.005000064609216319, 0.005004925699118321, 
0.0050000068554171585, 0.005000088146444176, 0.005000002943794836, 
0.005000000740400182, 0.005001814975687912, 0.005000001493803021, 
0.0050001206863792644, 0.005000000123809999, 0.005000005933099882, 
0.005000000031021147, 0.005000002103382519, 0.005000000004626093, 
0.005000000001086049, 0.005001888167096368, 0.005000000754488572, 
0.005000000108570436, 0.005001877526710762, 0.005000000861973192, 
0.005000000216054848, 0.005001866885777133]
   >>> model.summary().totalIterations
   38
   >>> model.transform(df).show()
   +-----+-------------+--------------------+----------+
   |label|     features|       rawPrediction|prediction|
   +-----+-------------+--------------------+----------+
   |  1.0|[1.0,1.0,1.0]|[-0.9999981142568...|       1.0|
   |  0.0|[1.0,2.0,3.0]|[0.99999811425680...|       0.0|
   +-----+-------------+--------------------+----------+
   ```
   
   unfortunately, spark's solutions are not like R's, but between the two 
solution, this PR convergen faster than existing one (which has a `OWLQN: 
Failure` warning)
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to