Re: [Fwd: Re: Dirichlet Processing Clustering - Synthetic Control Data]

Ted Dunning Tue, 02 Feb 2010 21:52:05 -0800

This could also be caused if the prior is very diffuse.  This makes the
probability that a point will go to any new cluster quite low.  You can
compensate somewhat for this with different values of alpha.


I have had some half thoughts about how to improve the mixing and currently
think that starting conditions may be the trick.  Using something like
k-means++ to initialize the clusters might help enormously.

On Tue, Feb 2, 2010 at 7:05 PM, Jeff Eastman <j...@windwardsolutions.com>wrote:

> Just notice this didn't go to the list.
>
>
> Hi Jerry,
>
> I'm not sure why Dirichlet is doing that with this dataset and have not
> been able to get better results than you. I have gotten excellent results
> using it with other models on other datasets, so I'm pretty confident in the
> core implementation. Because of it's sampling nature, the algorithm is very
> difficult to debug, even using a fixed random number seed. My guess is that
> the model is not correct for large dimensional points, likely the pdf()
> calculation is not differentiating them sufficiently for distinct clusters
> to remain. The computeParameters() method could also be wrong. I'm not at
> all confident I got the math right and I doubt anybody has checked it.
>
> Unlike k-means where k determines the number of clusters, Dirichlet needs a
> k value which is larger than the number of clusters you hope to identify.
> Increasing k will cause more computation to occur of course, but many of the
> resulting clusters will not have captured much of the total population and
> will be easily identified by that value. I'd try 10 or 20 and see what
> happens. I used a population threshold in
> /examples/src/main/java/org/apache/mahout/clustering/dirichlet/DisplayNDirichlet.java
> to only draw models with > 5% of the population with some success. Of
> course, 2d points can be easily visualized whereas synthetic control cannot.
>
> In each iteration Dirichlet assigns each point to one of the models using a
> multinomial sampling of the pdf() * mixture probabilities. This means that
> points don't always get assigned to the same models and it makes determining
> cluster assignments rather fuzzy. In
> /utils/src/test/java/org/apache/mahout/clustering/dirichlet/TestL1ModelClustering.java
> I had some success with taking the last iteration's models and using them to
> partial-order the input dataset. Taking the 'count' number of points off the
> top of the list gave me reasonable results, though you can see by running it
> that that heuristic can be too picky and miss some near-misses.
>
> Hope this helps, perhaps Ted can offer some suggestions. He's the brains
> behind the implementation <grin>
>
> Jeff
>
>
>
> Jerry Ye wrote:
>
>> Hi Jeff,
>> I tried to run the example for dirichlet process clustering on the
>> synthetic control data. I tried to set k = 5 for the number of clusters.
>> However, after the first state, I always end up with only a single cluster.
>> Essentially, for any number of iterations greater than 2, there is always
>> one model with all samples assigned to it. Any idea?
>>
>> Additionally, what’s the best way to figure out cluster assignments on a
>> test set given the model?
>>
>> Thanks.
>>
>> - jerry
>>
>> Command:
>> hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job
>> org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job --maxIter 3 -k 5
>>
>> Output:
>> sample[0]=
>> sample[1]= m0(527)nm{n=527 m=[29.91, 30.99, 32.08, 32.48, 32.04, 31.36,
>> 30.27, 29.05, 28.16, 28.16, 27.92, 28.85, 29.86, 30.73, 31.07, 31.45, 31.30,
>> 30.83, 30.39, 29.85, 29.71, 29.03, 29.17, 29.65, 29.69, 30.03, 30.61, 30.53,
>> 30.66, 30.61, 30.00, 30.29, 30.22, 30.33, 29.89, 30.14, 29.73, 29.69, 29.29,
>> 29.63, 29.74, 29.90, 30.34, 30.33, 30.17, 30.68, 30.48, 30.14, 30.17, 29.84,
>> 29.60, 29.42, 29.52, 29.58, 29.93, 30.23, 30.32, 30.08, 30.07, 30.19, ]
>> sd=10.29}, m1(24)nm{n=24 m=[30.11, 29.08, 29.74, 30.78, 31.00, 29.19, 30.32,
>> 29.58, 32.11, 28.30, 30.84, 31.58, 29.55, 31.57, 29.76, 28.74, 30.62, 30.77,
>> 28.26, 30.34, 30.37, 29.99, 30.40, 29.25, 30.34, 29.44, 29.60, 30.14, 30.74,
>> 28.89, 28.77, 30.09, 31.48, 29.67, 30.41, 29.71, 28.84, 30.94, 29.15, 30.52,
>> 30.90, 30.81, 30.62, 29.03, 30.15, 28.98, 29.31, 30.89, 30.24, 30.93, 29.62,
>> 29.40, 30.75, 29.96, 29.20, 29.42, 30.54, 28.96, 29.14, 30.91, ] sd=3.31},
>> m2(1)nm{n=1 m=[34.56, 35.50, 35.70, 24.69, 26.72, 29.56, 35.98, 33.22,
>> 30.46, 34.71, 31.30, 27.79, 30.69, 29.01, 30.45, 26.65, 25.13, 24.33, 24.76,
>> 29.19, 29.41, 34.65, 24.55, 34.29, 35.65, 27.11, 26.88, 24.27, 25.91, 33.84,
>> 30.68, 35.34, 27.10, 30.66, 28.98, 32.05, 31.93, 25.44, 34.23, 31.35, 25.31,
>> 34.49, 30.31, 25.37, 24.90, 28.54, 27.66, 28.28, 26.39, 32.40, 30.71, 24.88,
>> 26.75, 26.43, 34.04, 25.96, 28.24, 26.45, 24.84, 32.17, ] sd=0.00},
>> m3(39)nm{n=39 m=[31.07, 31.11, 31.16, 30.63, 29.44, 31.00, 29.92, 29.87,
>> 30.47, 30.17, 29.28, 29.11, 30.16, 29.14, 29.82, 29.62, 28.82, 29.92, 30.81,
>> 30.39, 28.98, 30.23, 29.66, 30.29, 30.41, 30.36, 29.13, 29.93, 28.88, 29.86,
>> 31.30, 29.80, 29.24, 29.69, 30.33, 29.70, 30.01, 30.54, 29.73, 28.76, 29.12,
>> 31.08, 29.37, 29.95, 29.57, 30.29, 30.17, 29.68, 30.21, 30.50, 31.79, 31.14,
>> 30.38, 29.46, 30.67, 30.98, 29.30, 29.95, 30.11, 29.89, ] sd=3.42},
>> m4(9)nm{n=9 m=[31.50, 29.89, 29.70, 29.40, 27.83, 27.77, 30.04, 30.11,
>> 27.86, 29.77, 29.19, 30.88, 29.93, 28.00, 31.38, 29.89, 30.30, 28.39, 31.32,
>> 30.12, 29.93, 29.82, 31.64, 30.74, 28.24, 27.82, 30.39, 30.53, 30.52, 31.72,
>> 31.60, 28.06, 28.78, 28.15, 30.23, 31.21, 30.85, 26.93, 31.35, 30.22, 29.07,
>> 28.78, 29.45, 30.95, 30.44, 29.21, 33.02, 28.94, 28.88, 26.63, 29.67, 33.16,
>> 29.15, 28.66, 30.33, 30.65, 30.13, 31.19, 29.08, 27.55, ] sd=3.06},
>> sample[2]= m0(1127)nm{n=600 m=[30.02, 30.91, 31.90, 32.23, 31.76, 31.19,
>> 30.26, 29.15, 28.46, 28.33, 28.15, 29.00, 29.87, 30.61, 30.94, 31.19, 31.09,
>> 30.72, 30.33, 29.91, 29.69, 29.17, 29.28, 29.70, 29.75, 29.99, 30.46, 30.47,
>> 30.54, 30.51, 30.06, 30.23, 30.18, 30.23, 29.94, 30.11, 29.73, 29.75, 29.35,
>> 29.62, 29.73, 30.01, 30.28, 30.25, 30.13, 30.56, 30.45, 30.12, 30.15, 29.88,
>> 29.75, 29.58, 29.61, 29.56, 29.96, 30.25, 30.26, 30.04, 30.01, 30.16, ]
>> sd=9.74},
>>
>
>
>
>


-- 
Ted Dunning, CTO
DeepDyve

Re: [Fwd: Re: Dirichlet Processing Clustering - Synthetic Control Data]

Reply via email to