This could also be caused if the prior is very diffuse. This makes the probability that a point will go to any new cluster quite low. You can compensate somewhat for this with different values of alpha.
I have had some half thoughts about how to improve the mixing and currently think that starting conditions may be the trick. Using something like k-means++ to initialize the clusters might help enormously. On Tue, Feb 2, 2010 at 7:05 PM, Jeff Eastman <j...@windwardsolutions.com>wrote: > Just notice this didn't go to the list. > > > Hi Jerry, > > I'm not sure why Dirichlet is doing that with this dataset and have not > been able to get better results than you. I have gotten excellent results > using it with other models on other datasets, so I'm pretty confident in the > core implementation. Because of it's sampling nature, the algorithm is very > difficult to debug, even using a fixed random number seed. My guess is that > the model is not correct for large dimensional points, likely the pdf() > calculation is not differentiating them sufficiently for distinct clusters > to remain. The computeParameters() method could also be wrong. I'm not at > all confident I got the math right and I doubt anybody has checked it. > > Unlike k-means where k determines the number of clusters, Dirichlet needs a > k value which is larger than the number of clusters you hope to identify. > Increasing k will cause more computation to occur of course, but many of the > resulting clusters will not have captured much of the total population and > will be easily identified by that value. I'd try 10 or 20 and see what > happens. I used a population threshold in > /examples/src/main/java/org/apache/mahout/clustering/dirichlet/DisplayNDirichlet.java > to only draw models with > 5% of the population with some success. Of > course, 2d points can be easily visualized whereas synthetic control cannot. > > In each iteration Dirichlet assigns each point to one of the models using a > multinomial sampling of the pdf() * mixture probabilities. This means that > points don't always get assigned to the same models and it makes determining > cluster assignments rather fuzzy. In > /utils/src/test/java/org/apache/mahout/clustering/dirichlet/TestL1ModelClustering.java > I had some success with taking the last iteration's models and using them to > partial-order the input dataset. Taking the 'count' number of points off the > top of the list gave me reasonable results, though you can see by running it > that that heuristic can be too picky and miss some near-misses. > > Hope this helps, perhaps Ted can offer some suggestions. He's the brains > behind the implementation <grin> > > Jeff > > > > Jerry Ye wrote: > >> Hi Jeff, >> I tried to run the example for dirichlet process clustering on the >> synthetic control data. I tried to set k = 5 for the number of clusters. >> However, after the first state, I always end up with only a single cluster. >> Essentially, for any number of iterations greater than 2, there is always >> one model with all samples assigned to it. Any idea? >> >> Additionally, what’s the best way to figure out cluster assignments on a >> test set given the model? >> >> Thanks. >> >> - jerry >> >> Command: >> hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job >> org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job --maxIter 3 -k 5 >> >> Output: >> sample[0]= >> sample[1]= m0(527)nm{n=527 m=[29.91, 30.99, 32.08, 32.48, 32.04, 31.36, >> 30.27, 29.05, 28.16, 28.16, 27.92, 28.85, 29.86, 30.73, 31.07, 31.45, 31.30, >> 30.83, 30.39, 29.85, 29.71, 29.03, 29.17, 29.65, 29.69, 30.03, 30.61, 30.53, >> 30.66, 30.61, 30.00, 30.29, 30.22, 30.33, 29.89, 30.14, 29.73, 29.69, 29.29, >> 29.63, 29.74, 29.90, 30.34, 30.33, 30.17, 30.68, 30.48, 30.14, 30.17, 29.84, >> 29.60, 29.42, 29.52, 29.58, 29.93, 30.23, 30.32, 30.08, 30.07, 30.19, ] >> sd=10.29}, m1(24)nm{n=24 m=[30.11, 29.08, 29.74, 30.78, 31.00, 29.19, 30.32, >> 29.58, 32.11, 28.30, 30.84, 31.58, 29.55, 31.57, 29.76, 28.74, 30.62, 30.77, >> 28.26, 30.34, 30.37, 29.99, 30.40, 29.25, 30.34, 29.44, 29.60, 30.14, 30.74, >> 28.89, 28.77, 30.09, 31.48, 29.67, 30.41, 29.71, 28.84, 30.94, 29.15, 30.52, >> 30.90, 30.81, 30.62, 29.03, 30.15, 28.98, 29.31, 30.89, 30.24, 30.93, 29.62, >> 29.40, 30.75, 29.96, 29.20, 29.42, 30.54, 28.96, 29.14, 30.91, ] sd=3.31}, >> m2(1)nm{n=1 m=[34.56, 35.50, 35.70, 24.69, 26.72, 29.56, 35.98, 33.22, >> 30.46, 34.71, 31.30, 27.79, 30.69, 29.01, 30.45, 26.65, 25.13, 24.33, 24.76, >> 29.19, 29.41, 34.65, 24.55, 34.29, 35.65, 27.11, 26.88, 24.27, 25.91, 33.84, >> 30.68, 35.34, 27.10, 30.66, 28.98, 32.05, 31.93, 25.44, 34.23, 31.35, 25.31, >> 34.49, 30.31, 25.37, 24.90, 28.54, 27.66, 28.28, 26.39, 32.40, 30.71, 24.88, >> 26.75, 26.43, 34.04, 25.96, 28.24, 26.45, 24.84, 32.17, ] sd=0.00}, >> m3(39)nm{n=39 m=[31.07, 31.11, 31.16, 30.63, 29.44, 31.00, 29.92, 29.87, >> 30.47, 30.17, 29.28, 29.11, 30.16, 29.14, 29.82, 29.62, 28.82, 29.92, 30.81, >> 30.39, 28.98, 30.23, 29.66, 30.29, 30.41, 30.36, 29.13, 29.93, 28.88, 29.86, >> 31.30, 29.80, 29.24, 29.69, 30.33, 29.70, 30.01, 30.54, 29.73, 28.76, 29.12, >> 31.08, 29.37, 29.95, 29.57, 30.29, 30.17, 29.68, 30.21, 30.50, 31.79, 31.14, >> 30.38, 29.46, 30.67, 30.98, 29.30, 29.95, 30.11, 29.89, ] sd=3.42}, >> m4(9)nm{n=9 m=[31.50, 29.89, 29.70, 29.40, 27.83, 27.77, 30.04, 30.11, >> 27.86, 29.77, 29.19, 30.88, 29.93, 28.00, 31.38, 29.89, 30.30, 28.39, 31.32, >> 30.12, 29.93, 29.82, 31.64, 30.74, 28.24, 27.82, 30.39, 30.53, 30.52, 31.72, >> 31.60, 28.06, 28.78, 28.15, 30.23, 31.21, 30.85, 26.93, 31.35, 30.22, 29.07, >> 28.78, 29.45, 30.95, 30.44, 29.21, 33.02, 28.94, 28.88, 26.63, 29.67, 33.16, >> 29.15, 28.66, 30.33, 30.65, 30.13, 31.19, 29.08, 27.55, ] sd=3.06}, >> sample[2]= m0(1127)nm{n=600 m=[30.02, 30.91, 31.90, 32.23, 31.76, 31.19, >> 30.26, 29.15, 28.46, 28.33, 28.15, 29.00, 29.87, 30.61, 30.94, 31.19, 31.09, >> 30.72, 30.33, 29.91, 29.69, 29.17, 29.28, 29.70, 29.75, 29.99, 30.46, 30.47, >> 30.54, 30.51, 30.06, 30.23, 30.18, 30.23, 29.94, 30.11, 29.73, 29.75, 29.35, >> 29.62, 29.73, 30.01, 30.28, 30.25, 30.13, 30.56, 30.45, 30.12, 30.15, 29.88, >> 29.75, 29.58, 29.61, 29.56, 29.96, 30.25, 30.26, 30.04, 30.01, 30.16, ] >> sd=9.74}, >> > > > > -- Ted Dunning, CTO DeepDyve