[
https://issues.apache.org/jira/browse/SPARK-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231474#comment-14231474
]
Travis Galoppo commented on SPARK-4156:
---------------------------------------
Ok, I looked into this. This is the result of using unit covariance matrices
for initialization; specifically, the numbers in the input files are quite
large, and [more importantly, I reckon] vary by relatively large amounts, thus
the initial unit covariance matrices are poor choices, driving the
probabilities to ~zero.
I tested the S1 dataset after scaling the inputs by 100000, and the algorithm
yielded:
w=0.018651 mu=[1.4005351951422986,5.560161272092209] sigma=
0.0047916181666818325 1.8492627979416199E-4
1.8492627979416199E-4 0.011135224999325288
w=0.070139 mu=[3.9826648305512444,4.048416241679408] sigma=
0.08975122201635877 0.011161215961635662
0.011161215961635662 0.07281211382882091
w=0.203390 mu=[4.50966114011736,8.335671907946685] sigma=
3.343575502968182 0.16780915524083184
0.16780915524083184 0.1983579752119624
w=0.061357 mu=[8.243819479262187,7.299054596484072] sigma=
0.059502423358168244 -0.01288330287962225
-0.01288330287962225 0.08306975793088611
w=0.068116 mu=[3.2082470765623987,1.6153321811600052] sigma=
0.13661341675065408 -0.004671801905049122
-0.004671801905049122 0.1184668732856653
w=0.015480 mu=[6.032605151728542,5.76477595221249] sigma=
0.006257088363533114 -0.01541684245322017
-0.01541684245322017 0.11177862390275095
w=0.069246 mu=[8.599898790732793,5.47222558625928] sigma=
0.08334577559917022 0.0025980740480378017
0.0025980740480378017 0.10560039597455884
w=0.066601 mu=[1.675642401646793,3.4768887461230293] sigma=
0.06718419616465754 -0.001992742042064677
-0.001992742042064677 0.08394612669156842
w=0.050884 mu=[1.4034421425114039,5.586799889184816] sigma=
0.18839808914440148 -0.017016991559440697
-0.017016991559440697 0.09967868623594711
w=0.067257 mu=[6.180341749904763,3.9855165348399026] sigma=
0.11162501735542207 0.0023201319648720187
0.0023201319648720187 0.09177325542363057
w=0.070096 mu=[5.078726203553804,1.756463619639961] sigma=
0.07852242299631484 0.03291628699789406
0.03291628699789406 0.08050080528055803
w=0.015951 mu=[5.989248184898113,5.729903049835485] sigma=
0.06204977226748554 0.008716828781302866
0.008716828781302866 0.003116768910125245
w=0.128860 mu=[8.274797410035061,2.390551639925522] sigma=
0.10976751308928101 -0.186908554330941
-0.186908554330941 0.7759289399492513
w=0.065259 mu=[3.3783618332560876,5.622632293334024] sigma=
0.10109765051996433 0.0320694359617697
0.0320694359617697 0.03873645329222697
w=0.028714 mu=[6.146091367146795,5.732902319554125] sigma=
0.2389354399409953 0.023579597914199724
0.023579597914199724 0.1377941370353355
Multiplying the MU values back by 100000 they show pretty good fidelity to the
truth values in s1-cb.txt provided on the source website for the dataset;
unfortunately, I do not see the original weight and covariance values used to
generate the data.
Of course it would be easier to use if the scaling step was not necessary; I
can modify the cluster initialization to use a covariance estimated from a
sample and see how it works out. What strategy did you use for initializing
clusters in your implementation?
cc: [~MeethuMathew]
> Add expectation maximization for Gaussian mixture models to MLLib clustering
> ----------------------------------------------------------------------------
>
> Key: SPARK-4156
> URL: https://issues.apache.org/jira/browse/SPARK-4156
> Project: Spark
> Issue Type: New Feature
> Components: MLlib
> Reporter: Travis Galoppo
> Assignee: Travis Galoppo
>
> As an additional clustering algorithm, implement expectation maximization for
> Gaussian mixture models
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]