zhengruifeng commented on a change in pull request #32124:
URL: https://github.com/apache/spark/pull/32124#discussion_r614494214
##########
File path: python/pyspark/ml/classification.py
##########
@@ -571,9 +571,9 @@ class LinearSVC(_JavaClassifier, _LinearSVCParams,
JavaMLWritable, JavaMLReadabl
>>> model.getMaxBlockSizeInMB()
0.0
>>> model.coefficients
- DenseVector([0.0, -0.2792, -0.1833])
+ DenseVector([0.0, -1.0319, -0.5159])
Review comment:
Firstly, this dataset only contains two instances, so I think the result
maybe not reliable:
R:
```r
> library(e1071)
> label <- factor((c(1.0, 0.0)))
> features <- as.matrix(data.frame(c(1.0, 1.0), c(1.0, 2.0), c(1.0, 3.0)))
> C <- 2.0 / 2 / 0.01
> model <- svm(features, label, type='C', kernel='linear', cost=C, scale=F,
tolerance=1e-4)
> w <- -t(model$coefs) %*% model$SV
> w
c.1..1. c.1..2. c.1..3.
[1,] 0 0.4 0.8
> model$rho
[1] -2.2
> predict(model, features)
1 2
1 0
Levels: 0 1
```
master:
```
>>> df = sc.parallelize([Row(label=1.0, features=Vectors.dense(1.0, 1.0,
1.0)), Row(label=0.0, features=Vectors.dense(1.0, 2.0, 3.0))]).toDF()
>>> svm = LinearSVC()
>>> svm.setRegParam(0.01)
LinearSVC_c0eb3d7e4ecb
>>> model = svm.fit(df)
21/04/16 09:17:11 ERROR OWLQN: Failure! Resetting history:
breeze.optimize.NaNHistory:
>>> model.summary().objectiveHistory
[1.0, 0.8483595765125, 0.7130226960596805, 0.6959063640335401,
0.6931614154979393, 0.677666741843437, 0.5879414534831688, 0.0930620718342175,
0.09048677587921486, 0.07102532249029628, 0.04374349680010344,
0.023954120035139716, 0.023830521501263628, 0.02320692346265251,
0.0215105944133274, 0.010552738391597573, 0.01051515768008,
0.009931999904022892, 0.008750399708406655, 0.0069553439215890535,
0.006868546522886791, 0.006515326943672154, 0.006347452798458171,
0.006187248462236068, 0.006164709258006091, 0.006053452700831725,
0.0060260493716176936, 0.005887552134192266, 0.005887136407436091,
0.005880225305895262, 0.005859805832943488, 0.005843641492521756,
0.005843641492521756, 0.00584486187964547, 0.005843267361152545,
0.005843241969094577, 0.005843234493660629, 0.0058432258624429665,
0.005843208081880606, 0.005843193072290253, 0.005843171629631225,
0.005843141001368335, 0.005843114614745189, 0.005843059249359493,
0.005843017763481445, 0.005842944768449553, 0.005842902553628102, 0.0
05842759959850206, 0.0058426880723804935, 0.0058424738985869045,
0.005842333714732286, 0.005842029622612579, 0.005841764628607474,
0.005841335619271196, 0.005840857505437478, 0.005840248258184164,
0.0058394259571752215, 0.005838527516717635, 0.00583821354404327,
0.005837212690021832, 0.005837208597515555, 0.005836810329205446,
0.005836776694348312, 0.00583640540121474, 0.00583635673616978,
0.005836167737106047, 0.005836194033268012, 0.00583616729576746,
0.0058361672269734615, 0.0058392815360454285, 0.005837679815125598,
0.005836401523490996, 0.005837908002584265, 0.005836192352808301,
0.005836670387508997, 0.005836202175777219, 0.005836200250798619,
0.005838876113322196]
>>> model.summary().totalIterations
77
>>> model.transform(df).show()
+-----+-------------+--------------------+----------+
|label| features| rawPrediction|prediction|
+-----+-------------+--------------------+----------+
| 1.0|[1.0,1.0,1.0]|[-1.0000039068924...| 1.0|
| 0.0|[1.0,2.0,3.0]|[0.99999460575938...| 0.0|
+-----+-------------+--------------------+----------+
```
this PR:
```
>>> from pyspark.sql import Row
>>> from pyspark.ml.linalg import Vectors
>>> from pyspark.ml.classification import LinearSVC
>>>
>>> df = sc.parallelize([Row(label=1.0, features=Vectors.dense(1.0, 1.0,
1.0)), Row(label=0.0, features=Vectors.dense(1.0, 2.0, 3.0))]).toDF()
>>> svm = LinearSVC()
>>> svm.setRegParam(0.01)
LinearSVC_09863ffadcb7
>>> model = svm.fit(df)
>>> model.summary().objectiveHistory
[1.0, 0.02084985122275127, 0.013524781160976402, 0.007607689403049228,
0.005824637199209566, 0.005323658700671529, 0.01838013981356318,
0.005089779695828527, 0.009036497849295026, 0.005023675663196612,
0.006332321775223654, 0.005004910245203961, 0.005002395070241166,
0.005000981376776186, 0.005042760733060569, 0.005000267637684085,
0.005013305173582782, 0.005000064609216319, 0.005004925699118321,
0.0050000068554171585, 0.005000088146444176, 0.005000002943794836,
0.005000000740400182, 0.005001814975687912, 0.005000001493803021,
0.0050001206863792644, 0.005000000123809999, 0.005000005933099882,
0.005000000031021147, 0.005000002103382519, 0.005000000004626093,
0.005000000001086049, 0.005001888167096368, 0.005000000754488572,
0.005000000108570436, 0.005001877526710762, 0.005000000861973192,
0.005000000216054848, 0.005001866885777133]
>>> model.summary().totalIterations
38
>>> model.transform(df).show()
+-----+-------------+--------------------+----------+
|label| features| rawPrediction|prediction|
+-----+-------------+--------------------+----------+
| 1.0|[1.0,1.0,1.0]|[-0.9999981142568...| 1.0|
| 0.0|[1.0,2.0,3.0]|[0.99999811425680...| 0.0|
+-----+-------------+--------------------+----------+
```
unfortunately, spark's solutions are not like R's, but between the two
solution, this PR convergen faster than existing one (which has a `OWLQN:
Failure` warning)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]