Hi Alex and group, Alright, I've done this: https://gist.github.com/chryss/3d708c66b5efdfe39741 https://dl.dropboxusercontent.com/u/372734/hfdi_debug.npz -- data file (4 data sets)
As these things go, while putting together the code for you to look at, the problem went away again, temporarily. Originally, I chose these four datasets because two gave my a nice bimodal fit with weights approx. 50/50 (or 40/60), and two didn't (weights 99/1). When raising the threshold value to the default level of 0.01 made it go away for these two. But out of my original 120 datasets I need to raise the threshold very far (0.5) for GMM to use both Gaussians. So my question somewhat changed: how do I chose this value? I thought lowering it would force the model to seek a better fit, but the opposite seems to be taking place. I guess I am pretty unclear as to its meaning. The near-flat curves may also be driven by the outliers. Raising the threshold too far seems to negatively impact the quality of the fit for the datasets that work well -- and these are those I'm mainly interested in. So I could do a two-step process: 1. Fit ALL datasets with a very high threshold 2. Keep only the 20% best datasets (the ones with the least overlap/best separation between the two peaks) and re-fit with a lower threshold. I'd be grateful for your input. Thanks, Chris On May 14, 2014, at 4:53 PM, Alexandre Gramfort wrote: > hi Chris, > > you should share a gist on gist.github.com with a .npy containing the > data to reproduce the problem. > > Best, > Alex > > On Thu, May 15, 2014 at 2:15 AM, Chris Waigl <cwa...@alaska.edu> wrote: >> Hi sklearn community, >> >> I'm new on this list, Python user of many years, and maybe an advanced >> beginner with scikit-learn, which I've used for a previous project. I'll >> just jump in with my question. >> >> I'm trying to use sklearn.mixture.GMM to fit (fairly) bimodal scalar data. >> The data values can theoretically vary between 0 and 1. They're represented >> as a float32 Numpy arras. The scalar value is in fact calculated using a >> combination of two spectral bands (infrared remote sensing), and I'm trying >> to find the band combination that produces an index that best separates the >> two modes. >> >> I find that for some band combinations (and therefore histograms), GMM very >> nicely fits two Gaussians. Example plots of very good fits: >> https://dl.dropboxusercontent.com/u/372734/IMG/boundary_HFDI_GMM_193_216.png >> https://dl.dropboxusercontent.com/u/372734/IMG/boundary_HFDI_GMM_191_219.png >> >> Examples of bad fits (that is, one Gaussian dominates with a weight of >> approx. 99%, the other one is flat): >> https://dl.dropboxusercontent.com/u/372734/IMG/boundary_HFDI_GMM_192_216.png >> https://dl.dropboxusercontent.com/u/372734/IMG/boundary_HFDI_GMM_193_212.png >> >> I'm calling the model as follows. The scalar index is called hfdi, and it >> lives on a 2D grid. >> >>> from sklearn.mixture import GMM >>> ... >>> g = GMM(n_components=2) >>> g.fit(hfdi.flatten()) >> >> g.converged_ nearly always returns True. >> >> I also tried to play with some of the arguments: >>> g = GMM(n_components=2, thresh=0.0001, n_init=5, n_iter=1000) >> >> ... but with no improvement other than if I reduce the threshold too much I >> produce division-by-zero errors (I think). >> >> I only have about 200 samples. Maybe that's not enough. Any advice? >> >> Thanks, >> >> Chris Waigl >> >> -- >> Chris Waigl - cwa...@alaska.edu - +1-907-474-5483 - Skype: cwaigl_work >> Geophysical Institute, UAF, 903 Koyukuk Drive, Fairbanks, AK 99775-7320, USA >> >> >> ------------------------------------------------------------------------------ >> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE >> Instantly run your Selenium tests across 300+ browser/OS combos. >> Get unparalleled scalability from the best Selenium testing platform >> available >> Simple to use. Nothing to install. Get started now for free." >> http://p.sf.net/sfu/SauceLabs >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > ------------------------------------------------------------------------------ > "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE > Instantly run your Selenium tests across 300+ browser/OS combos. > Get unparalleled scalability from the best Selenium testing platform available > Simple to use. Nothing to install. Get started now for free." > http://p.sf.net/sfu/SauceLabs > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Chris Waigl - cwa...@alaska.edu - +1-907-474-5483 - Skype: cwaigl_work Geophysical Institute, UAF, 903 Koyukuk Drive, Fairbanks, AK 99775-7320, USA ------------------------------------------------------------------------------ "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE Instantly run your Selenium tests across 300+ browser/OS combos. Get unparalleled scalability from the best Selenium testing platform available Simple to use. Nothing to install. Get started now for free." http://p.sf.net/sfu/SauceLabs _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general