David,
Have you looked at the K Means algorithm? It uses a similar approach of a
two-phase iteration to determine clustering. In K means you're looking for
K cluster centers, such that when each point is assigned to the nearest
cluster, the total of the distances from points to their clusters is
minimized. The general idea, then, is that you need to determine two
things: (1) cluster membership and (2) cluster locations. Solving for (1)
& (2) simultaneously is difficult, but by iteratively fixing (1) and
minimizing (2), then fixing (2) and minimizing (1), it can be shown that
you eventually converge to the optimum solution. In this case the E-step
corresponds to assigning cluster membership (which cluster is the point
*expected* to lie in?) and the M-step corresponds to shifting the cluster
locations (minimizing the cost function), *keeping the cluster assignments
constant*. Keeping things constant might seem like an ad-hoc assumption,
but it's actually a very well-founded method.
GMM is a bit more complicated: in step (1), rather than a cluster ID for
each point, you have an array of weights describing how much each cluster
"contributes" to each point. In step (2), rather than having a set of
cluster locations, you have a set of cluster centers, a set of cluster
widths or width matrices (for N-dimensional gaussians) and a set of
normalizations. Despite the added complexity, the E-M algorithm still
works (which, frankly, still amazes me). In the E step, you hold the
cluster characteristics constant and compute cluster weights for each
point. In the M step, you fix the weights of each point and minimize the
cost by changing the cluster characteristics. The math (and therefore the
code) is a bit more opaque, but at each step it essentially comes down to
minimizing a Gaussian likelihood function while keeping either the cluster
weights or the cluster characteristics constant.
I hope that explanation is helpful...
Jake
On Sat, Sep 7, 2013 at 8:38 AM, David Reed <[email protected]> wrote:
> ok, this is what I can gather from the code:
>
> ================================
>
> Expectation Step
> --------------------------
> Calculate the loglikelihood and responsibilities for each sample.
> a. for each sample the loglikelihood is calculated for each gaussian
> and then sum across models (logprob = logsumexp(lpr, axis=1))
> b. responsibilites is just lpr, but normalized to 1 across each row?
> This is figure out which gaussian model is responsible for the
> loglikelihood of this sample correct?
>
> Maximization Step
> ---------------------------
> Need help here, can't quite figure out whats going on
>
> ================================
>
> As you can see the maximization step is where things fall apart for me.
> I've followed many tutorials, including this (
> https://www.youtube.com/watch?v=AnbiNaVp3eQ). They eventually involve
> sufficient statistics, etc. which seem to make some sense at some high
> level, but I can never connect that theory to code, in particular the code
> in the GMM module.
>
> Can someone help me fill in the blanks? Sorry if this is a little open
> ended
>
> -Dave
>
>
>
>
>
>
> On Tue, Sep 3, 2013 at 4:51 PM, Olivier Grisel
> <[email protected]>wrote:
>
>> 2013/9/3 David Reed <[email protected]>:
>> > Thanks Jake, I will look into this more. Is this the proper forum to
>> ask
>> > questions about its implementation if I am confused?
>>
>> Yes it is.
>>
>> --
>> Olivier
>> http://twitter.com/ogrisel - http://github.com/ogrisel
>>
>>
>> ------------------------------------------------------------------------------
>> Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
>> Discover the easy way to master current and previous Microsoft
>> technologies
>> and advance your career. Get an incredible 1,500+ hours of step-by-step
>> tutorial videos with LearnDevNow. Subscribe today and save!
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>
>
> ------------------------------------------------------------------------------
> Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
> Discover the easy way to master current and previous Microsoft technologies
> and advance your career. Get an incredible 1,500+ hours of step-by-step
> tutorial videos with LearnDevNow. Subscribe today and save!
> http://pubads.g.doubleclick.net/gampad/clk?id=58041391&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58041391&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general