Hi Alex and group,

Alright, I've done this:
https://gist.github.com/chryss/3d708c66b5efdfe39741
https://dl.dropboxusercontent.com/u/372734/hfdi_debug.npz -- data file (4 data 
sets)

As these things go, while putting together the code for you to look at, the 
problem went away again, temporarily. Originally, I chose these four datasets 
because two gave my a nice bimodal fit with weights approx. 50/50 (or 40/60), 
and two didn't (weights 99/1). When raising the threshold value to the default 
level of 0.01 made it go away for these two. But out of my original 120 
datasets I need to raise the threshold very far (0.5) for GMM to use both 
Gaussians. 

So my question somewhat changed: how do I chose this value? I thought lowering 
it would force the model to seek a better fit, but the opposite seems to be 
taking place. I guess I am pretty unclear as to its meaning. The near-flat 
curves may also be driven by the outliers. 

Raising the threshold too far seems to negatively impact the quality of the fit 
for the datasets that work well -- and these are those I'm mainly interested 
in. So I could do a two-step process:
1. Fit ALL datasets with a very high threshold
2. Keep only the 20% best datasets (the ones with the least overlap/best 
separation between the two peaks) and re-fit with a lower threshold.

I'd be grateful for your input.

Thanks,

Chris


On May 14, 2014, at 4:53 PM, Alexandre Gramfort wrote:

> hi Chris,
> 
> you should share a gist on gist.github.com with a .npy containing the
> data to reproduce the problem.
> 
> Best,
> Alex
> 
> On Thu, May 15, 2014 at 2:15 AM, Chris Waigl <cwa...@alaska.edu> wrote:
>> Hi sklearn community,
>> 
>> I'm new on this list, Python user of many years, and maybe an advanced 
>> beginner with scikit-learn, which I've used for a previous project. I'll 
>> just jump in with my question.
>> 
>> I'm trying to use sklearn.mixture.GMM to fit (fairly) bimodal scalar data. 
>> The data values can theoretically vary between 0 and 1. They're represented 
>> as a float32 Numpy arras. The scalar value is in fact calculated using a 
>> combination of two spectral bands (infrared remote sensing), and I'm trying 
>> to find the band combination that produces an index that best separates the 
>> two modes.
>> 
>> I find that for some band combinations (and therefore histograms), GMM very 
>> nicely fits two Gaussians. Example plots of very good fits:
>> https://dl.dropboxusercontent.com/u/372734/IMG/boundary_HFDI_GMM_193_216.png
>> https://dl.dropboxusercontent.com/u/372734/IMG/boundary_HFDI_GMM_191_219.png
>> 
>> Examples of bad fits (that is, one Gaussian dominates with a weight of 
>> approx. 99%, the other one is flat):
>> https://dl.dropboxusercontent.com/u/372734/IMG/boundary_HFDI_GMM_192_216.png
>> https://dl.dropboxusercontent.com/u/372734/IMG/boundary_HFDI_GMM_193_212.png
>> 
>> I'm calling the model as follows. The scalar index is called hfdi, and it 
>> lives on a 2D grid.
>> 
>>> from sklearn.mixture import GMM
>>> ...
>>> g = GMM(n_components=2)
>>> g.fit(hfdi.flatten())
>> 
>> g.converged_ nearly always returns True.
>> 
>> I also tried to play with some of the arguments:
>>> g = GMM(n_components=2, thresh=0.0001, n_init=5, n_iter=1000)
>> 
>> ... but with no improvement other than if I reduce the threshold too much I 
>> produce division-by-zero errors (I think).
>> 
>> I only have about 200 samples. Maybe that's not enough. Any advice?
>> 
>> Thanks,
>> 
>> Chris Waigl
>> 
>> --
>> Chris Waigl - cwa...@alaska.edu -  +1-907-474-5483 - Skype: cwaigl_work
>> Geophysical Institute, UAF, 903 Koyukuk Drive, Fairbanks, AK 99775-7320, USA
>> 
>> 
>> ------------------------------------------------------------------------------
>> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
>> Instantly run your Selenium tests across 300+ browser/OS combos.
>> Get unparalleled scalability from the best Selenium testing platform 
>> available
>> Simple to use. Nothing to install. Get started now for free."
>> http://p.sf.net/sfu/SauceLabs
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> 
> ------------------------------------------------------------------------------
> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
> Instantly run your Selenium tests across 300+ browser/OS combos.
> Get unparalleled scalability from the best Selenium testing platform available
> Simple to use. Nothing to install. Get started now for free."
> http://p.sf.net/sfu/SauceLabs
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

-- 
Chris Waigl - cwa...@alaska.edu -  +1-907-474-5483 - Skype: cwaigl_work
Geophysical Institute, UAF, 903 Koyukuk Drive, Fairbanks, AK 99775-7320, USA


------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.
Get unparalleled scalability from the best Selenium testing platform available
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to