+ [email protected] so the conversation will be visible to others.
Hi Pat,
I'm at a loss as to how the inter-cluster density can be zero. The tests
that were producing those values have been fixed. Does the
ClusterEvaluator produce zero with your data too? If so, let's debug
that one first as it shares the representative points computation and is
a lot easier to debug.
How many representative points are you computing? Have you inspected
them to see if they look ok? There are routines in the two evaluator
unit tests that will print them out and we can make them public static
if it will help. Since they are identical it might also make sense to
move them into a public utility. I will do that if you think it will be
useful.
-------- Original Message --------
Subject: Re: [jira] [Commented] (MAHOUT-1020) The Cluster Evaluator is
returning bad results
Date: Fri, 01 Jun 2012 09:43:45 -0700
From: Pat Ferrel <[email protected]>
To: Jeff Eastman <[email protected]>
It is always 0 on any data set I've tried, even when no pruning is
reported. The debug output I sent you had no reported pruning as I
recall. But again I'm on 0.6, upgrading as we write...
On 6/1/12 9:35 AM, Jeff Eastman wrote:
I don't understand the inter-cluster density = 0. The tests that were
producing those values were in error and they now produce reasonable
looking densities. Have you taken a look at the representative points
produced from your clusters? If they are all the same then pruning
will occur and you might end up with nothing left to evaluate.
On 6/1/12 12:28 PM, Pat Ferrel wrote:
Sure, it is attached. It iterates through a small data set of 228
docs and 3-7 clusters with kmeans. The results are the output of both
evaluators on the resulting clusters. Still on 0.6 I'm afraid.
How about the CDbw output of inter-cluster distance always = 0.0? I
understand that it is an important measure.
On 6/1/12 9:23 AM, Jeff Eastman wrote:
This patch fixed a problem with the unit test that was causing
kmeans and fuzzyk tests to fail. It did not change any of the CDbw
evaluation code, which now seems to produce reasonable results for
all tests. I also just fixed the same problem in the ClusterEvaluator.
I can't seem to find the debug output you mention. Can you please
repost it?
On 6/1/12 11:12 AM, Pat Ferrel wrote:
The representative point calc is used in the general case too, not
just the test case. And didn't you say that bad representative
points leads to having the cluster pruned? Also does this fix the
inter-cluster distance always = 0?
I do need to move to trunk I suppose, then I will test.
BTW did the debug output I sent you look like reasonable results?
On 6/1/12 8:05 AM, Hudson (JIRA) wrote:
[
https://issues.apache.org/jira/browse/MAHOUT-1020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287466#comment-13287466
]
Hudson commented on MAHOUT-1020:
--------------------------------
Integrated in Mahout-Quality #1509 (See
[https://builds.apache.org/job/Mahout-Quality/1509/])
MAHOUT-1020: fixed path names for testKmeans and
testFuzzyKmeans that were causing representative points
calculation to fail. CDbw results now look more reasonable.
(Revision 1345214)
Result = FAILURE
jeastman :
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1345214
Files :
*
/mahout/trunk/integration/src/test/java/org/apache/mahout/clustering/cdbw/TestCDbwEvaluator.java
The Cluster Evaluator is returning bad results
----------------------------------------------
Key: MAHOUT-1020
URL:
https://issues.apache.org/jira/browse/MAHOUT-1020
Project: Mahout
Issue Type: Bug
Components: Clustering
Affects Versions: 0.6
Environment: Various environments and data sets. Mahout
0.6, 0.7 trunk not tested.
Reporter: Pat Ferrel
Assignee: Jeff Eastman
Fix For: 0.7
Conversation with between Pat Ferrel and Jeff Eastman on the user
list
Hi Pat,
I don't have a good answer here. Evidently, something in CDbw has
become broken and you are the first to notice. When I run
TestCDbwEvaluator, the values for k-means and fuzzy-k are clearly
incorrect. The values for Canopy, MeanShift and Dirichlet are not
so obviously incorrect but I remain suspicious. Something must
have become broken in the recent clustering refactoring.
From the method CDbwEvaluator.invalidCluster comment (used to
enable pruning):
* Return if the cluster is valid. Valid clusters must have
more than 2 representative points,
* and at least one of them must be different than the cluster
center. This is because the
* representative points extraction will duplicate the cluster
center if it is empty.
Oddly enough, inspection of the test log indicates that only
k-means and fuzzy-k are not pruning clusters. Clearly some more
investigation is needed. I will take a look at it tomorrow. In
the mean time if you develop any additional insight please do
share it with us.
Thanks,
Jeff
On 5/17/12 3:53 PM, Pat Ferrel wrote:
I built a tool that iterates through a list of values for k on
the same data and spits out the CDbw and ClusterEvaluator
results each time.
When the evaluator or CDbw prunes a cluster, how do I interpret
that? They seem to throw out the same clusters on a given run.
Also CDbw always returns an inter-cluster density of 0?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA
administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see:
http://www.atlassian.com/software/jira