For what it's worth I'm calculating the distance for every doc as I go through clusteredPoints. Since I'm not storing the results in memory this works OK but is a tad slower. My code is also now dependent on the distance measure where it was not before.

However this is not blocking me. Maybe it becomes a feature request--just something that got lost in refactoring?

On 6/29/12 12:48 PM, Jeff Eastman wrote:
[email protected]  Let's have this conversation for everybody on the list too

The pdf() of all DistanceMeasureClusters is:

  public double pdf(VectorWritable vw) {
    return 1 / (1 + measure.distance(vw.get(), getCenter()));
  }

for CosineDistance, the pdf values should be distributed on 1..2. Aha! if you look at AbstractClusteringPolicy.classify() what is happening is the pdf vector is being normalized:

  public Vector classify(Vector data, ClusterClassifier prior) {
    List<Cluster> models = prior.getModels();
    int i = 0;
    Vector pdfs = new DenseVector(models.size());
    for (Cluster model : models) {
      pdfs.set(i++, model.pdf(new VectorWritable(data)));
    }
    return pdfs.assign(new TimesFunction(), 1.0 / pdfs.zSum());
  }

... and that will surely mess up the reverse distance calculation. Is there a way around this? Let me stew about it some...

Jeff

On 6/29/12 3:27 PM, Pat Ferrel wrote:
Whoa, the 0.7 snapshot message below gave me an idea that I had some old artifacts in the path. Took them out and it IS working.

However, sorry if I'm being dense, but the formula for pdf given is pdf = 1/(1+distance) unless I messed up my algebra that means
distance = (1/pdf) - 1 which gives values impossible with cosine.

It almost looks like the weights below are 1- distance so distance = 1-pdf?

Maclaurin:big-data pat$ mahout seqdumper -i b2/kmeans-clusters/clusteredPoints/part-m-00000 | moreMAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.
MAHOUT_LOCAL is set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/pat/Projects/mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
12/06/29 11:58:55 INFO common.AbstractJob: Command line arguments: {--endPhase=[2147483647], --input=[b2/kmeans-clusters/clusteredPoints/part-m-00000], --startPhase=[0], --tempDir=[temp]} 2012-06-29 11:58:55.449 java[27127:1903] Unable to load realm info from SCDynamicStore
Input Path: b2/kmeans-clusters/clusteredPoints/part-m-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.classify.WeightedVectorWritable Key: 832: Value: 0.02946182601035338: http://farfetchers.com/ = [2223:0.729, 2862:0.501, 3573:0.467] Key: 819: Value: 0.03323576094647134: http://farfetchers.com/blog = [1:0.034, 9:0.021, 27:0.039, 28:0.026, 31:0.022, 33:0.032, 37:0.034, 38:0.022, 39:0.043, 44:0.029, 49:0.022, 51:0.025, 56:0.024, 60:0.029, 72:0.038, 74:0.020, 81:0.035, 82:0.037, 87:0.041, 89:0.033, 91:0.032, 104:0.034, 107:0.039, 112:0.034, 116:0.043, 121:0.017, 129:0.034, 136:0.035, 147:0.035, 148:0.031, 161:0.035,
On 6/29/12 11:08 AM, Pat Ferrel wrote:
Hmm, still the data in kmeans-clusters/clusteredPoints/part-m-00000 has all weights of 1.0

I checked to make sure the data was created with rebuilt code and that git knew the patched files were changed so the patch was included. I see the code in the IDE but I build with maven skipping tests. I looked through quite a few so can assume all are 1.0.

Maclaurin:big-data pat$ mahout seqdumper -i b2/kmeans-clusters/clusteredPoints/part-m-00000 | more
MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.
MAHOUT_LOCAL is set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/pat/Projects/mahout/examples/target/mahout-examples-0.7-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/Users/pat/Projects/mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. 12/06/29 10:58:16 INFO common.AbstractJob: Command line arguments: {--endPhase=[2147483647], --input=[b2/kmeans-clusters/clusteredPoints/part-m-00000], --startPhase=[0], --tempDir=[temp]} 2012-06-29 10:58:16.587 java[25768:1903] Unable to load realm info from SCDynamicStore
Input Path: b2/kmeans-clusters/clusteredPoints/part-m-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.classify.WeightedVectorWritable Key: 792: Value: 1.0: http://farfetchers.com/ = [2223:0.729, 2862:0.501, 3573:0.467] Key: 791: Value: 1.0: http://farfetchers.com/blog = [1:0.034, 9:0.021, 27:0.039, 28:0.026, 31:0.022, 33:0.032, 37:0.034, 38:0.022, 39:0.043, 44:0.029, 49:0.022, 51:0.025, 56:0.024, 60:0.029, 72:0.038, 74:0.020, 81:0.035, 82:0.037, 87:0.041, 89:0.033, 91:0.032, 104:0.034, 107:0.039, 112:0.034, 1


On 6/29/12 10:06 AM, Jeff Eastman wrote:
You were correct, the documented weights were not being set. I just uploaded a much smaller patch that fixes that. Please let me know if that works for you.

Jeff

On 6/29/12 12:27 PM, Pat Ferrel wrote:
OK. It's actually in the docs, MiA at least, that it will be 1 or 0 (never 0 in kmeans since the 0 docs are dropped from clusteredPoints).

I mention the patch only because it would be easy enough to put the pdf in the properties there if I knew where to look for it.

On 6/29/12 9:21 AM, Jeff Eastman wrote:
HMN, let me investigate this.


On 6/29/12 12:01 PM, Pat Ferrel wrote:

What is returned as the weight in the WeightedVectorWritable is pdfPerCluster.maxValue(), which is 1.0 for kmeans and so you cannot calculate the distance from this.

I'd fix this in the patch but I don't know where to find the actual pdf for kmeans since the one returned it is rounded to 1 or 0.












Reply via email to