Bogdan Vatkov wrote:
unfortunately I am using private data which I cannot share. I am using emails, indexed by Solr and then creating vectors out of them. I am using them with k-means and everything is ok. Just wanted to try out the Dirichlet algorithm.On Thu, Jan 14, 2010 at 8:49 PM, Jeff Eastman <[email protected]>wrote:I gather you are doing text clustering? Are you using one of our example datasets or one which is publicly available? Bogdan Vatkov wrote:Hi Jeff, What kind of details do you need to continue? In the mean time I am anyway going back to kmeans (maybe I really start with adding canopy to my kmeans only scenario first ;)). Best regards, Bogdan On Thu, Jan 14, 2010 at 1:49 AM, Jeff Eastman <[email protected]wrote:I think KMeans and Canopy are the most-used and therefore the most robust. Dirichlet still has not seen much use beyond some test examples and NormalModel has at least one known problem (with sample() only returning the maximum likelihood) that has been reported but never fixed. Can you point me to the problem you are running so I can try to get up to speed? It has been some time since I worked in this code but I'm keen to do so and I have some time to invest. Jeff Bogdan Vatkov wrote:But I am the first one to use Dirichlet which algorithm is the recommended one? Are all other algs better then Dirichlet so no one used it ;)? On Thu, Jan 14, 2010 at 1:23 AM, Jeff Eastman < [email protected]wrote:The NormalModelDistribution seems to still think all the data vectors are size=2. In SampleFromPrior, it is creating models with that size. Subsequently, when you calculate the pdf with your data value (x) the sizes are incompatible. Suggest changing 'DenseVector(2)' to 'DenseVector(n)', where n is your data cardinality. Please also look at the rest of the math in DenseVector with suspiscion. AFAIK, you are the first person to try to use Dirichlet. Bogdan Vatkov wrote:I see a stack when the size of the vectore mean is set to 2: Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in NormalModel)) NormalModel.<init>(Vector, double) line: 48 NormalModelDistribution.sampleFromPrior(int) line: 33 DirichletState<O>.<init>(ModelDistribution<O>, int, double, int, int) line: 48 DirichletDriver.createState(String, int, double) line: 172 DirichletDriver.writeInitialState(String, String, String, int, double) line: 150 DirichletDriver.runJob(String, String, String, int, int, double, int) line: 133 DirichletDriver.main(String[]) line: 109 Clusters.doClustering() line: 244 Clusters.access$0(Clusters) line: 175 Clusters$1.run() line: 148 Thread.run() line: 619 public class NormalModelDistribution implements ModelDistribution<Vector> { @Override public Model<Vector>[] sampleFromPrior(int howMany) { Model<Vector>[] result = new NormalModel[howMany]; for (int i = 0; i < howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); } return result; } and later this vector is dotted to @Override public double pdf(Vector x) { double sd2 = stdDev * stdDev; double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 * sd2); double ex = Math.exp(exp); return ex / (stdDev * sqrt2pi); } x vector which is coming from Hadoop MapRunner through the map function: public void map(WritableComparable<?> key, Vector v, OutputCollector<Text, Vector> output, Reporter reporter) throws IOException { any idea? btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it safe enough to run against trunk? On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <[email protected]> wrote:On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov < [email protected]wrote: Sorry, what does that mean :)?It means that there is probably a programming bug somehow. At the very least, the program is not robust with respect to strange invocations.what is a dotted vector? and why aren't they the same?dot product is a vector operation that is the sum of products of corresponding elements of the two vectors being operated on. If these vectors don't have the same length, then it is an error. what should I investigate? I am not familiar with the code, but if I had time to look, my strategy would be to start in the NormalModel and work back up the stack trace to find out how the vectors came to be different lengths. No doubt, the code in NormalModel will not tell you anything, but you can see which vectors are involved and by walking up the stack you may be able to see where they come from.I am basically running my complete kmeans scenario (same input data, same number of clusters param, etc.) but just replacing KmeansDriver.main step with a DirichletDriver.main call...of course the arguments are adjusted since kmeans and dirichlet do not have the same arguments.I would think that this sounds very plausible.I am not sure what number I should give for the alpha argument,Alpha should have a value in the range from 0.01 to 20. I would scan with 1,2, 5 magnitude steps to see what works well for your data. (i.e. 0.01, 0.02, 0.05, 0.1, 0.2 ... 20). A value of 1 is a fine place to start. The effect of different values should be small over a pretty wide range.iterations and reductions...here is my current argument set: args = new String[] { "--input","/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec","--output", config.getClustersDir(), "--modelClass", "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution", "--maxIter", "15", "--alpha", "1.0", "--k", config.getClustersCount(), "--maxRed", "2" };Not off-hand.
