zhengruifeng commented on issue #26735: [SPARK-30102][WIP][ML][PYSPARK] GMM 
supports instance weighting
URL: https://github.com/apache/spark/pull/26735#issuecomment-563044225
 
 
   It really took me some days to look into the test failure.
   
   1, In 2.4.4, I can not reproduce the doctests:
   ```python
       >>> summary.clusterSizes
       [2, 2, 2]
       >>> summary.logLikelihood
       8.14636...
   ```
   until I explictly set numPartition=2, like this `df = 
spark.createDataFrame(sc.parallelize(data, 2), ["features"])`.
   That is because existing `df = spark.createDataFrame(data, ["features"])` 
will create a df with 12 partitions, and GMM is highly sensitive to the 
intialization.
   It is also weird to me that `spark.createDataFrame` will create a df with 6 
partitions in the scala side.
   My latop has a 8850 cpu with 6cores and 12threads.
   
   2, After using `df = spark.createDataFrame(sc.parallelize(data, 2), 
["features"])`, I can reproduce the results in 2.4.4. However, the doctests 
still fail, I log the optimization metric `logLikelihood` after each iteration 
and find that it seems a sudden numeric change.
   
   ----
   Iteration|0|1|2|3|4|5|6|7|8|9|
   
Master|-13.306466494963615|-0.4307654468425961|0.49157579336057605|2.234212048899172|6.125367537295512|11.27762326533469|35.232285502171976|10.028821186214191|23.693392686726106|8.146360246481793|
   This 
PR|-13.306466494963615|-0.430765446842597|0.4915757933605755|2.234212048899182|6.125367537295558|11.277623265335476|35.229680601767065|46.33491773124833|57.694248782061024|26.193922336279954|
   ---
   
   The metrics are near before iter-7, but some sudden numeric change happened 
in iter-7. But I think it is acceptable since the internal computation is 
complex.
   Moreover, current convergence check `math.abs(logLikelihood - 
logLikelihoodPrev) > $(tol)` do not work when optimization objective meet a big 
hit. Like `logLikelihood` drop from 35.232285502171976 to 10.028821186214191 in 
iter-7.
   
   So I think I need to:
   1, change the df generation logic with explictly set numpartition; (current 
`createDataFrame` do not support this input, I need to create a rdd first)
   2, change the result in the doctest (I tend to set `MaxIter=5` and 
result=11.27)
   3, change the convergence check and avoid big drop in optimzation 
metric(maybe in another PR and check other algs in it)
   
   @srowen @huaxingao  How do you think about it?
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to