Hello,
I would like to do some clustering by using Hadoop and I found Mahout. I am
really impressed, but as a newbie I got stuck and I have several questions. The
idea is to do string clustering: I have properties values expressed as strings
of some resources, and I would like to aggregate these resources. I use Eclipse
as IDE, and I have two Mahout working projects, one with release version (0.1)
and the other one with SVN version. I am able to compile examples and to run
them on my own Hadoop cluster. I have focused on Synthetic Control Data example
using Canopy algorithm because of its similarity to my problem.
- on release version with default parameter values I get all the items on the
same cluster (C1), is it normal?
- on SVN version I don't have a readable output because there is no implemented
OutputDriver. If I use the same as release version, I got exceptions (I think
that format has changed between releases, for example using '{' symbol instead
of '[')
- I use string values instead of double values. I have implemented my own
string distance that returns a double when parameters are string, but I think
that Mahout Vectors are implemented just to store double values. Is there any
chance to use string values?
I would be very grateful if anyone could help me.
Thank you very much in advance.
Regards,
jfcg
_________________________________________________________________
¿Quieres los nuevos emoticonos en 3D? ¡Descárgatelos gratis!
http://www.vivelive.com/emoticonos3d/index2.html