[
https://issues.apache.org/jira/browse/MAHOUT-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13881974#comment-13881974
]
Pat Ferrel commented on MAHOUT-1410:
------------------------------------
Looks like you changed the distance-squared to distance, which is fine by me.
since it's labeled correctly.
Get a failure in the mahout build tests but I generally do so I've been
building without tests. This has been since 0.8 so probably my own problem not
a real test failure. Some dump info is listed below.
When I build without tests the patch works perfectly on kmeans and fuzzykmeans.
These are run using the drivers so mapreduce. I checked the the input to see if
the vector matches the ones in clustered points and indeed they do. So I'd call
that confirmation.
Questions i'd ask in a review:
1) Are you changing the type of the vector in clusteredPoints to NamedVector or
was it always that way with a blank name?
2) If the class is changing are we sure that doesn't mess things up when
actually using a named vector as clustering input?
3) Not sure how to test "sequential" clustering but if you tell me what you
mean I can test that too.
This is fixed for me.
pat$ mahout seqdumper -i ~/big-data/guide/temp/tmp1/pairwiseSimilarity/ | more
Input Path:
file:/Users/pat/big-data/guide/temp/tmp1/pairwiseSimilarity/part-r-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class
org.apache.mahout.math.VectorWritable
Key: 0: Value:
{5:0.854718643737767,172:0.8292121703264371,192:0.854718643737767,13:0.8962379566075429,198:0.8962379566075429,20:0.8962379566075429,9:0.8962379566075429,19:0.8962379566075429,201:0.8962379566075429,207:0.8962379566075429,43:0.6976395601234899,193:0.8962379566075429,18:0.8962379566075429,184:0.8962379566075429,209:0.8962379566075429,187:0.854718643737767,2:0.8962379566075429,211:0.854718643737767,27:0.7327087555023397,177:0.854718643737767,183:0.8292121703264371,214:0.8962379566075429,17:0.854718643737767,190:0.854718643737767,176:0.8292121703264371,12:0.8962379566075429,191:0.854718643737767}
The above is what I'm clustering so the input and below is the clusterePoints,
so the output:
pat$ mahout seqdumper -i
/Users/pat/big-data/guide/temp/clusters/clusteredPoints/part-m-00000 | more
Key: 593: Value: wt: 1.0 distance: 4.487859968341614 vec: 0 = [2:0.896,
5:0.855, 9:0.896, 12:0.896, 13:0.896, 17:0.855, 18:0.896, 19:0.896, 20:0.896,
27:0.733, 43:0.698, 172:0.829, 176:0.829, 177:0.855, 183:0.829, 184:0.896,
187:0.855, 190:0.855, 191:0.855, 192:0.855, 193:0.896, 198:0.896, 201:0.896,
207:0.896, 209:0.896, 211:0.855, 214:0.896]
Build Test Failure Results :
Failed tests:
ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemovalMR:102->assertVectorsWithOutlierRemoval:188->checkClustersWithOutlierRemoval:238->Assert.assertTrue:41->Assert.fail:88
not expecting cluster:0:{0:1.0,1:1.0}
Tests run: 761, Failures: 1, Errors: 0, Skipped: 0
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Mahout Build Tools ................................ SUCCESS [2.347s]
[INFO] Apache Mahout ..................................... SUCCESS [0.519s]
[INFO] Mahout Math ....................................... SUCCESS [1:18.563s]
[INFO] Mahout Core ....................................... FAILURE [5:50.906s]
[INFO] Mahout Integration ................................ SKIPPED
[INFO] Mahout Examples ................................... SKIPPED
[INFO] Mahout Release Package ............................ SKIPPED
[INFO] Mahout Math/Scala wrappers ........................ SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 7:12.848s
[INFO] Finished at: Sat Jan 25 08:45:56 PST 2014
[INFO] Final Memory: 28M/554M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-surefire-plugin:2.16:test (default-test) on
project mahout-core: There are test failures.
[
> clusteredPoints do not contain a vector id
> ------------------------------------------
>
> Key: MAHOUT-1410
> URL: https://issues.apache.org/jira/browse/MAHOUT-1410
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.8
> Environment: using 0.9 release candidate
> Reporter: Pat Ferrel
> Assignee: Suneel Marthi
> Fix For: 0.9
>
> Attachments: MAHOUT-1410.patch, MAHOUT-1410.patch
>
>
> When clustering non-named vectors there are no vector ids in clusteredPoints
> so the other values there, cluster id, vector values, distance-squared, pdf,
> cannot be tied to any known vector.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)