Hello Ted, Thank you very much for your answer, but I think I can't understand it very well. Could you give me some more details? For example, what does 'DP' stand for? You can see an example of what I would like to do in my previous answer.
I'm so sorry for these questions, but I'm starting in this field. Thank you very much for your time. Regards, jfcg > From: [email protected] > Date: Fri, 28 Aug 2009 11:15:22 -0700 > Subject: Re: String clustering and other newbie questions > To: [email protected] > > To cluster strings, you need to have a distance between "centroids" and > strings. The DP clustering stuff could handle this, but not the rest of the > clustering. The way that it would work in DP would be that there would be > parametrized models that describe probabilities of generating strings > instead of just being multi-dimensional points. The similarity of a string > to a model is interpreted as the probability of the string given the model. > > On Fri, Aug 28, 2009 at 11:09 AM, Jeff Eastman > <[email protected]>wrote: > > > Well, all of the clustering code is based upon clustering points in an > > n-dimensional vector space and all of the APIs operate upon Vectors. We do > > support the ability to attach a label binding Map to a Vector which can map > > Strings into integer index values. Once this has been done you can access > > the vector values symbolically. I'm not sure this will help with your > > problem and you may need to write your own Canopy. > > > > If you can post some examples of the values you wish to cluster and > > something of your distance measure then I will see if I can figure out a way > > to help you further. > > > > Jeff > > > > > > > > Juan Francisco Contreras Gaitan wrote: > > > >> Thank you so much for your quick reply. > >> > >> Unfortunately, I'm afraid that there is no way of massaging my strings > >> into doubles, because the distance measure would have no sense in terms of > >> doubles. Could you please give me some clue to write the required code in > >> order to solve this difficulty? > >> > >> Thank you very much again. > >> > >> Regards, > >> jfcg > >> > >> > >> > >>> Date: Fri, 28 Aug 2009 08:49:38 -0700 > >>> From: [email protected] > >>> To: [email protected] > >>> Subject: Re: String clustering and other newbie questions > >>> > >>> Juan Francisco Contreras Gaitan wrote: > >>> > >>> > >>>> Hello, > >>>> > >>>> I would like to do some clustering by using Hadoop and I found Mahout. I > >>>> am really impressed, but as a newbie I got stuck and I have several > >>>> questions. The idea is to do string clustering: I have properties values > >>>> expressed as strings of some resources, and I would like to aggregate > >>>> these > >>>> resources. I use Eclipse as IDE, and I have two Mahout working projects, > >>>> one > >>>> with release version (0.1) and the other one with SVN version. I am able > >>>> to > >>>> compile examples and to run them on my own Hadoop cluster. I have > >>>> focused on > >>>> Synthetic Control Data example using Canopy algorithm because of its > >>>> similarity to my problem. > >>>> > >>>> - on release version with default parameter values I get all the items > >>>> on the same cluster (C1), is it normal? > >>>> > >>>> > >>> Are you running the Synthetic Control example data here? That example - I > >>> just ran it on trunk - should produce 6 clusters in one file. It is binary > >>> encoded though, and difficult to interpret in textual representation. If > >>> you > >>> search for the string 'SparseVector' in the canopies/part-0000 file you > >>> should see six instances. > >>> > >>> > >>>> - on SVN version I don't have a readable output because there is no > >>>> implemented OutputDriver. If I use the same as release version, I got > >>>> exceptions (I think that format has changed between releases, for example > >>>> using '{' symbol instead of '[') > >>>> > >>>> > >>> The output formats of all the clustering routines are now sequence files > >>> which are binary encoded. The old OutputDriver won't handle it. > >>> > >>> > >>>> - I use string values instead of double values. I have implemented my > >>>> own string distance that returns a double when parameters are string, > >>>> but I > >>>> think that Mahout Vectors are implemented just to store double values. Is > >>>> there any chance to use string values? > >>>> > >>>> > >>> Vectors are double only and you will need to massage your data into > >>> numeric format to use out of the box clustering. Is there a way to convert > >>> your property values into doubles? > >>> > >>> > >>>> I would be very grateful if anyone could help me. > >>>> > >>>> > >>> I'm going to be working on converting clustering to Hadoop 0.20 in the > >>> next weeks. Let's continue our dialog. > >>> > >>> > >>>> Thank you very much in advance. > >>>> > >>>> Regards, > >>>> jfcg > >>>> > >>>> _________________________________________________________________ > >>>> ¿Quieres los nuevos emoticonos en 3D? ¡Descárgatelos gratis! > >>>> http://www.vivelive.com/emoticonos3d/index2.html > >>>> > >>>> > >>> > >> _________________________________________________________________ > >> Internet Explorer 8 más sencillo y seguro ¡Descárgatelo gratis! > >> http://events.es.msn.com/noticias/internet-explorer-8/ > >> > >> > > > > > > > -- > Ted Dunning, CTO > DeepDyve _________________________________________________________________ Con Vodafone disfruta de Hotmail gratis en tu móvil. ¡Pruébalo! http://serviciosmoviles.es.msn.com/hotmail/vodafone.aspx
