Just notice this didn't go to the list.
--- Begin Message ---
Hi Jerry,
I'm not sure why Dirichlet is doing that with this dataset and have not
been able to get better results than you. I have gotten excellent
results using it with other models on other datasets, so I'm pretty
confident in the core implementation. Because of it's sampling nature,
the algorithm is very difficult to debug, even using a fixed random
number seed. My guess is that the model is not correct for large
dimensional points, likely the pdf() calculation is not differentiating
them sufficiently for distinct clusters to remain. The
computeParameters() method could also be wrong. I'm not at all confident
I got the math right and I doubt anybody has checked it.
Unlike k-means where k determines the number of clusters, Dirichlet
needs a k value which is larger than the number of clusters you hope to
identify. Increasing k will cause more computation to occur of course,
but many of the resulting clusters will not have captured much of the
total population and will be easily identified by that value. I'd try 10
or 20 and see what happens. I used a population threshold in
/examples/src/main/java/org/apache/mahout/clustering/dirichlet/DisplayNDirichlet.java
to only draw models with > 5% of the population with some success. Of
course, 2d points can be easily visualized whereas synthetic control cannot.
In each iteration Dirichlet assigns each point to one of the models
using a multinomial sampling of the pdf() * mixture probabilities. This
means that points don't always get assigned to the same models and it
makes determining cluster assignments rather fuzzy. In
/utils/src/test/java/org/apache/mahout/clustering/dirichlet/TestL1ModelClustering.java
I had some success with taking the last iteration's models and using
them to partial-order the input dataset. Taking the 'count' number of
points off the top of the list gave me reasonable results, though you
can see by running it that that heuristic can be too picky and miss some
near-misses.
Hope this helps, perhaps Ted can offer some suggestions. He's the brains
behind the implementation <grin>
Jeff
Jerry Ye wrote:
Hi Jeff,
I tried to run the example for dirichlet process clustering on the
synthetic control data. I tried to set k = 5 for the number of
clusters. However, after the first state, I always end up with only a
single cluster. Essentially, for any number of iterations greater than
2, there is always one model with all samples assigned to it. Any idea?
Additionally, what’s the best way to figure out cluster assignments on
a test set given the model?
Thanks.
- jerry
Command:
hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job
org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job --maxIter
3 -k 5
Output:
sample[0]=
sample[1]= m0(527)nm{n=527 m=[29.91, 30.99, 32.08, 32.48, 32.04,
31.36, 30.27, 29.05, 28.16, 28.16, 27.92, 28.85, 29.86, 30.73, 31.07,
31.45, 31.30, 30.83, 30.39, 29.85, 29.71, 29.03, 29.17, 29.65, 29.69,
30.03, 30.61, 30.53, 30.66, 30.61, 30.00, 30.29, 30.22, 30.33, 29.89,
30.14, 29.73, 29.69, 29.29, 29.63, 29.74, 29.90, 30.34, 30.33, 30.17,
30.68, 30.48, 30.14, 30.17, 29.84, 29.60, 29.42, 29.52, 29.58, 29.93,
30.23, 30.32, 30.08, 30.07, 30.19, ] sd=10.29}, m1(24)nm{n=24
m=[30.11, 29.08, 29.74, 30.78, 31.00, 29.19, 30.32, 29.58, 32.11,
28.30, 30.84, 31.58, 29.55, 31.57, 29.76, 28.74, 30.62, 30.77, 28.26,
30.34, 30.37, 29.99, 30.40, 29.25, 30.34, 29.44, 29.60, 30.14, 30.74,
28.89, 28.77, 30.09, 31.48, 29.67, 30.41, 29.71, 28.84, 30.94, 29.15,
30.52, 30.90, 30.81, 30.62, 29.03, 30.15, 28.98, 29.31, 30.89, 30.24,
30.93, 29.62, 29.40, 30.75, 29.96, 29.20, 29.42, 30.54, 28.96, 29.14,
30.91, ] sd=3.31}, m2(1)nm{n=1 m=[34.56, 35.50, 35.70, 24.69, 26.72,
29.56, 35.98, 33.22, 30.46, 34.71, 31.30, 27.79, 30.69, 29.01, 30.45,
26.65, 25.13, 24.33, 24.76, 29.19, 29.41, 34.65, 24.55, 34.29, 35.65,
27.11, 26.88, 24.27, 25.91, 33.84, 30.68, 35.34, 27.10, 30.66, 28.98,
32.05, 31.93, 25.44, 34.23, 31.35, 25.31, 34.49, 30.31, 25.37, 24.90,
28.54, 27.66, 28.28, 26.39, 32.40, 30.71, 24.88, 26.75, 26.43, 34.04,
25.96, 28.24, 26.45, 24.84, 32.17, ] sd=0.00}, m3(39)nm{n=39 m=[31.07,
31.11, 31.16, 30.63, 29.44, 31.00, 29.92, 29.87, 30.47, 30.17, 29.28,
29.11, 30.16, 29.14, 29.82, 29.62, 28.82, 29.92, 30.81, 30.39, 28.98,
30.23, 29.66, 30.29, 30.41, 30.36, 29.13, 29.93, 28.88, 29.86, 31.30,
29.80, 29.24, 29.69, 30.33, 29.70, 30.01, 30.54, 29.73, 28.76, 29.12,
31.08, 29.37, 29.95, 29.57, 30.29, 30.17, 29.68, 30.21, 30.50, 31.79,
31.14, 30.38, 29.46, 30.67, 30.98, 29.30, 29.95, 30.11, 29.89, ]
sd=3.42}, m4(9)nm{n=9 m=[31.50, 29.89, 29.70, 29.40, 27.83, 27.77,
30.04, 30.11, 27.86, 29.77, 29.19, 30.88, 29.93, 28.00, 31.38, 29.89,
30.30, 28.39, 31.32, 30.12, 29.93, 29.82, 31.64, 30.74, 28.24, 27.82,
30.39, 30.53, 30.52, 31.72, 31.60, 28.06, 28.78, 28.15, 30.23, 31.21,
30.85, 26.93, 31.35, 30.22, 29.07, 28.78, 29.45, 30.95, 30.44, 29.21,
33.02, 28.94, 28.88, 26.63, 29.67, 33.16, 29.15, 28.66, 30.33, 30.65,
30.13, 31.19, 29.08, 27.55, ] sd=3.06},
sample[2]= m0(1127)nm{n=600 m=[30.02, 30.91, 31.90, 32.23, 31.76,
31.19, 30.26, 29.15, 28.46, 28.33, 28.15, 29.00, 29.87, 30.61, 30.94,
31.19, 31.09, 30.72, 30.33, 29.91, 29.69, 29.17, 29.28, 29.70, 29.75,
29.99, 30.46, 30.47, 30.54, 30.51, 30.06, 30.23, 30.18, 30.23, 29.94,
30.11, 29.73, 29.75, 29.35, 29.62, 29.73, 30.01, 30.28, 30.25, 30.13,
30.56, 30.45, 30.12, 30.15, 29.88, 29.75, 29.58, 29.61, 29.56, 29.96,
30.25, 30.26, 30.04, 30.01, 30.16, ] sd=9.74},
--- End Message ---