I just tried removing the normalization step and DisplayKMeans produces
exactly the same result. Since the pdfs vector us just an accumulation
of pdf values I think perhaps the normalization isn't necessary. The
only gotcha would be if a ClusterClassifier were ever used as an
AbstractVectorClassifier, since that API implies normalization (the
final value is 1-sum_of_scores). But ClusterClassifier returns the whole
vector so it really doesn't satisfy that API.
Does anybody care?
Can you try that? Does it give you realistic distances now?
e.g. return pdfs;
On 6/29/12 3:48 PM, Jeff Eastman wrote:
[email protected] Let's have this conversation for everybody on the list too
The pdf() of all DistanceMeasureClusters is:
public double pdf(VectorWritable vw) {
return 1 / (1 + measure.distance(vw.get(), getCenter()));
}
for CosineDistance, the pdf values should be distributed on 1..2. Aha!
if you look at AbstractClusteringPolicy.classify() what is happening
is the pdf vector is being normalized:
public Vector classify(Vector data, ClusterClassifier prior) {
List<Cluster> models = prior.getModels();
int i = 0;
Vector pdfs = new DenseVector(models.size());
for (Cluster model : models) {
pdfs.set(i++, model.pdf(new VectorWritable(data)));
}
return pdfs.assign(new TimesFunction(), 1.0 / pdfs.zSum());
}
... and that will surely mess up the reverse distance calculation. Is
there a way around this? Let me stew about it some...
Jeff
On 6/29/12 3:27 PM, Pat Ferrel wrote:
Whoa, the 0.7 snapshot message below gave me an idea that I had some
old artifacts in the path. Took them out and it IS working.
However, sorry if I'm being dense, but the formula for pdf given is
pdf = 1/(1+distance) unless I messed up my algebra that means
distance = (1/pdf) - 1 which gives values impossible with cosine.
It almost looks like the weights below are 1- distance so distance =
1-pdf?
Maclaurin:big-data pat$ mahout seqdumper -i
b2/kmeans-clusters/clusteredPoints/part-m-00000 | moreMAHOUT_LOCAL is
set, so we don't add HADOOP_CONF_DIR to classpath.
MAHOUT_LOCAL is set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/Users/pat/Projects/mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
12/06/29 11:58:55 INFO common.AbstractJob: Command line arguments:
{--endPhase=[2147483647],
--input=[b2/kmeans-clusters/clusteredPoints/part-m-00000],
--startPhase=[0], --tempDir=[temp]}
2012-06-29 11:58:55.449 java[27127:1903] Unable to load realm info
from SCDynamicStore
Input Path: b2/kmeans-clusters/clusteredPoints/part-m-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class
org.apache.mahout.clustering.classify.WeightedVectorWritable
Key: 832: Value: 0.02946182601035338: http://farfetchers.com/ =
[2223:0.729, 2862:0.501, 3573:0.467]
Key: 819: Value: 0.03323576094647134: http://farfetchers.com/blog =
[1:0.034, 9:0.021, 27:0.039, 28:0.026, 31:0.022, 33:0.032, 37:0.034,
38:0.022, 39:0.043, 44:0.029, 49:0.022, 51:0.025, 56:0.024, 60:0.029,
72:0.038, 74:0.020, 81:0.035, 82:0.037, 87:0.041, 89:0.033, 91:0.032,
104:0.034, 107:0.039, 112:0.034, 116:0.043, 121:0.017, 129:0.034,
136:0.035, 147:0.035, 148:0.031, 161:0.035,
On 6/29/12 11:08 AM, Pat Ferrel wrote:
Hmm, still the data in kmeans-clusters/clusteredPoints/part-m-00000
has all weights of 1.0
I checked to make sure the data was created with rebuilt code and
that git knew the patched files were changed so the patch was
included. I see the code in the IDE but I build with maven skipping
tests. I looked through quite a few so can assume all are 1.0.
Maclaurin:big-data pat$ mahout seqdumper -i
b2/kmeans-clusters/clusteredPoints/part-m-00000 | more
MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.
MAHOUT_LOCAL is set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/Users/pat/Projects/mahout/examples/target/mahout-examples-0.7-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/Users/pat/Projects/mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
12/06/29 10:58:16 INFO common.AbstractJob: Command line arguments:
{--endPhase=[2147483647],
--input=[b2/kmeans-clusters/clusteredPoints/part-m-00000],
--startPhase=[0], --tempDir=[temp]}
2012-06-29 10:58:16.587 java[25768:1903] Unable to load realm info
from SCDynamicStore
Input Path: b2/kmeans-clusters/clusteredPoints/part-m-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class
org.apache.mahout.clustering.classify.WeightedVectorWritable
Key: 792: Value: 1.0: http://farfetchers.com/ = [2223:0.729,
2862:0.501, 3573:0.467]
Key: 791: Value: 1.0: http://farfetchers.com/blog = [1:0.034,
9:0.021, 27:0.039, 28:0.026, 31:0.022, 33:0.032, 37:0.034, 38:0.022,
39:0.043, 44:0.029, 49:0.022, 51:0.025, 56:0.024, 60:0.029,
72:0.038, 74:0.020, 81:0.035, 82:0.037, 87:0.041, 89:0.033,
91:0.032, 104:0.034, 107:0.039, 112:0.034, 1
On 6/29/12 10:06 AM, Jeff Eastman wrote:
You were correct, the documented weights were not being set. I just
uploaded a much smaller patch that fixes that. Please let me know
if that works for you.
Jeff
On 6/29/12 12:27 PM, Pat Ferrel wrote:
OK. It's actually in the docs, MiA at least, that it will be 1 or
0 (never 0 in kmeans since the 0 docs are dropped from
clusteredPoints).
I mention the patch only because it would be easy enough to put
the pdf in the properties there if I knew where to look for it.
On 6/29/12 9:21 AM, Jeff Eastman wrote:
HMN, let me investigate this.
On 6/29/12 12:01 PM, Pat Ferrel wrote:
What is returned as the weight in the WeightedVectorWritable is
pdfPerCluster.maxValue(), which is 1.0 for kmeans and so you
cannot calculate the distance from this.
I'd fix this in the patch but I don't know where to find the
actual pdf for kmeans since the one returned it is rounded to 1
or 0.