zhengruifeng commented on issue #26735: [SPARK-30102][WIP][ML][PYSPARK] GMM supports instance weighting URL: https://github.com/apache/spark/pull/26735#issuecomment-563044225 It really took me some days to look into the test failure. 1, In 2.4.4, I can not reproduce the doctests: ```python >>> summary.clusterSizes [2, 2, 2] >>> summary.logLikelihood 8.14636... ``` until I explictly set numPartition=2, like this `df = spark.createDataFrame(sc.parallelize(data, 2), ["features"])`. That is because existing `df = spark.createDataFrame(data, ["features"])` will create a df with 12 partitions, and GMM is highly sensitive to the intialization. It is also weird to me that `spark.createDataFrame` will create a df with 6 partitions in the scala side. My latop has a 8850 cpu with 6cores and 12threads. 2, After using `df = spark.createDataFrame(sc.parallelize(data, 2), ["features"])`, I can reproduce the results in 2.4.4. However, the doctests still fail, I log the optimization metric `logLikelihood` after each iteration and find that it seems a sudden numeric change. ---- Iteration|0|1|2|3|4|5|6|7|8|9| Master|-13.306466494963615|-0.4307654468425961|0.49157579336057605|2.234212048899172|6.125367537295512|11.27762326533469|35.232285502171976|10.028821186214191|23.693392686726106|8.146360246481793| This PR|-13.306466494963615|-0.430765446842597|0.4915757933605755|2.234212048899182|6.125367537295558|11.277623265335476|35.229680601767065|46.33491773124833|57.694248782061024|26.193922336279954| --- The metrics are near before iter-7, but some sudden numeric change happened in iter-7. But I think it is acceptable since the internal computation is complex. Moreover, current convergence check `math.abs(logLikelihood - logLikelihoodPrev) > $(tol)` do not work when optimization objective meet a big hit. Like `logLikelihood` drop from 35.232285502171976 to 10.028821186214191 in iter-7. So I think I need to: 1, change the df generation logic with explictly set numpartition; (current `createDataFrame` do not support this input, I need to create a rdd first) 2, change the result in the doctest (I tend to set `MaxIter=5` and result=11.27) 3, change the convergence check and avoid big drop in optimzation metric(maybe in another PR and check other algs in it) @srowen @huaxingao How do you think about it?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
