Sorry for the delay. One simplified example could be the following. Values:
Rolling Stone; Organisation The Rolling Stones; MusicGroups Like a Rolling Stone; MusicSongs A Rolling Stone; MusicSongs Rolling Stone; Magazine And a sample of distance metric could be Levenshtein distance. So, between the first item and the following, distances would be 13, 12.2, 10.19, 9. And exactly the same for the following items. The idea is that if we suppose 3 clusters, I expect to have item 1 in Cluster 1, items 2-3-4 in Cluster 2 and item 5 in Cluster 3. I hope this could clarify a little bit. I don't know the algorithm deeply, so I don't know if numerical values has importance apart from distance computation. If not, I think that the idea of Mapping could be enough for our purposes. Could you give me some more information or where to start reading from? Thank you very much. Regards, jfcg > Date: Fri, 28 Aug 2009 11:09:57 -0700 > From: [email protected] > To: [email protected] > Subject: Re: String clustering and other newbie questions > > Well, all of the clustering code is based upon clustering points in an > n-dimensional vector space and all of the APIs operate upon Vectors. We > do support the ability to attach a label binding Map to a Vector which > can map Strings into integer index values. Once this has been done you > can access the vector values symbolically. I'm not sure this will help > with your problem and you may need to write your own Canopy. > > If you can post some examples of the values you wish to cluster and > something of your distance measure then I will see if I can figure out a > way to help you further. > > Jeff > > > > Juan Francisco Contreras Gaitan wrote: > > Thank you so much for your quick reply. > > > > Unfortunately, I'm afraid that there is no way of massaging my strings into > > doubles, because the distance measure would have no sense in terms of > > doubles. Could you please give me some clue to write the required code in > > order to solve this difficulty? > > > > Thank you very much again. > > > > Regards, > > jfcg > > > > > >> Date: Fri, 28 Aug 2009 08:49:38 -0700 > >> From: [email protected] > >> To: [email protected] > >> Subject: Re: String clustering and other newbie questions > >> > >> Juan Francisco Contreras Gaitan wrote: > >> > >>> Hello, > >>> > >>> I would like to do some clustering by using Hadoop and I found Mahout. I > >>> am really impressed, but as a newbie I got stuck and I have several > >>> questions. The idea is to do string clustering: I have properties values > >>> expressed as strings of some resources, and I would like to aggregate > >>> these resources. I use Eclipse as IDE, and I have two Mahout working > >>> projects, one with release version (0.1) and the other one with SVN > >>> version. I am able to compile examples and to run them on my own Hadoop > >>> cluster. I have focused on Synthetic Control Data example using Canopy > >>> algorithm because of its similarity to my problem. > >>> > >>> - on release version with default parameter values I get all the items on > >>> the same cluster (C1), is it normal? > >>> > >>> > >> Are you running the Synthetic Control example data here? That example - > >> I just ran it on trunk - should produce 6 clusters in one file. It is > >> binary encoded though, and difficult to interpret in textual > >> representation. If you search for the string 'SparseVector' in the > >> canopies/part-0000 file you should see six instances. > >> > >>> - on SVN version I don't have a readable output because there is no > >>> implemented OutputDriver. If I use the same as release version, I got > >>> exceptions (I think that format has changed between releases, for example > >>> using '{' symbol instead of '[') > >>> > >>> > >> The output formats of all the clustering routines are now sequence files > >> which are binary encoded. The old OutputDriver won't handle it. > >> > >>> - I use string values instead of double values. I have implemented my own > >>> string distance that returns a double when parameters are string, but I > >>> think that Mahout Vectors are implemented just to store double values. Is > >>> there any chance to use string values? > >>> > >>> > >> Vectors are double only and you will need to massage your data into > >> numeric format to use out of the box clustering. Is there a way to > >> convert your property values into doubles? > >> > >>> I would be very grateful if anyone could help me. > >>> > >>> > >> I'm going to be working on converting clustering to Hadoop 0.20 in the > >> next weeks. Let's continue our dialog. > >> > >>> Thank you very much in advance. > >>> > >>> Regards, > >>> jfcg > >>> > >>> _________________________________________________________________ > >>> ¿Quieres los nuevos emoticonos en 3D? ¡Descárgatelos gratis! > >>> http://www.vivelive.com/emoticonos3d/index2.html > >>> > >>> > > > > _________________________________________________________________ > > Internet Explorer 8 más sencillo y seguro ¡Descárgatelo gratis! > > http://events.es.msn.com/noticias/internet-explorer-8/ > > > _________________________________________________________________ Con Vodafone disfruta de Hotmail gratis en tu móvil. ¡Pruébalo! http://serviciosmoviles.es.msn.com/hotmail/vodafone.aspx
