RE: String clustering and other newbie questions

Juan Francisco Contreras Gaitan Mon, 31 Aug 2009 05:02:44 -0700

Hello Ted,

Thank you very much for your answer, but I think I can't understand it very 
well. Could you give me some more details? For example, what does 'DP' stand 
for? You can see an example of what I would like to do in my previous answer.


I'm so sorry for these questions, but I'm starting in this field.

Thank you very much for your time.

Regards,
jfcg

> From: [email protected]
> Date: Fri, 28 Aug 2009 11:15:22 -0700
> Subject: Re: String clustering and other newbie questions
> To: [email protected]
> 
> To cluster strings, you need to have a distance between "centroids" and
> strings.  The DP clustering stuff could handle this, but not the rest of the
> clustering.  The way that it would work in DP would be that there would be
> parametrized models that describe probabilities of generating strings
> instead of just being multi-dimensional points.  The similarity of a string
> to a model is interpreted as the probability of the string given the model.
> 
> On Fri, Aug 28, 2009 at 11:09 AM, Jeff Eastman
> <[email protected]>wrote:
> 
> > Well, all of the clustering code is based upon clustering points in an
> > n-dimensional vector space and all of the APIs operate upon Vectors. We do
> > support the ability to attach a label binding Map to a Vector which can map
> > Strings into integer index values. Once this has been done you can access
> > the vector values symbolically. I'm not sure this will help with your
> > problem and you may need to write your own Canopy.
> >
> > If you can post some examples of the values you wish to cluster and
> > something of your distance measure then I will see if I can figure out a way
> > to help you further.
> >
> > Jeff
> >
> >
> >
> > Juan Francisco Contreras Gaitan wrote:
> >
> >> Thank you so much for your quick reply.
> >>
> >> Unfortunately, I'm afraid that there is no way of massaging my strings
> >> into doubles, because the distance measure would have no sense in terms of
> >> doubles. Could you please give me some clue to write the required code in
> >> order to solve this difficulty?
> >>
> >> Thank you very much again.
> >>
> >> Regards,
> >> jfcg
> >>
> >>
> >>
> >>> Date: Fri, 28 Aug 2009 08:49:38 -0700
> >>> From: [email protected]
> >>> To: [email protected]
> >>> Subject: Re: String clustering and other newbie questions
> >>>
> >>> Juan Francisco Contreras Gaitan wrote:
> >>>
> >>>
> >>>> Hello,
> >>>>
> >>>> I would like to do some clustering by using Hadoop and I found Mahout. I
> >>>> am really impressed, but as a newbie I got stuck and I have several
> >>>> questions. The idea is to do string clustering: I have properties values
> >>>> expressed as strings of some resources, and I would like to aggregate 
> >>>> these
> >>>> resources. I use Eclipse as IDE, and I have two Mahout working projects, 
> >>>> one
> >>>> with release version (0.1) and the other one with SVN version. I am able 
> >>>> to
> >>>> compile examples and to run them on my own Hadoop cluster. I have 
> >>>> focused on
> >>>> Synthetic Control Data example using Canopy algorithm because of its
> >>>> similarity to my problem.
> >>>>
> >>>> - on release version with default parameter values I get all the items
> >>>> on the same cluster (C1), is it normal?
> >>>>
> >>>>
> >>> Are you running the Synthetic Control example data here? That example - I
> >>> just ran it on trunk - should produce 6 clusters in one file. It is binary
> >>> encoded though, and difficult to interpret in textual representation. If 
> >>> you
> >>> search for the string 'SparseVector' in the canopies/part-0000 file you
> >>> should see six instances.
> >>>
> >>>
> >>>> - on SVN version I don't have a readable output because there is no
> >>>> implemented OutputDriver. If I use the same as release version, I got
> >>>> exceptions (I think that format has changed between releases, for example
> >>>> using '{' symbol instead of '[')
> >>>>
> >>>>
> >>> The output formats of all the clustering routines are now sequence files
> >>> which are binary encoded. The old OutputDriver won't handle it.
> >>>
> >>>
> >>>> - I use string values instead of double values. I have implemented my
> >>>> own string distance that returns a double when parameters are string, 
> >>>> but I
> >>>> think that Mahout Vectors are implemented just to store double values. Is
> >>>> there any chance to use string values?
> >>>>
> >>>>
> >>> Vectors are double only and you will need to massage your data into
> >>> numeric format to use out of the box clustering. Is there a way to convert
> >>> your property values into doubles?
> >>>
> >>>
> >>>> I would be very grateful if anyone could help me.
> >>>>
> >>>>
> >>> I'm going to be working on converting clustering to Hadoop 0.20 in the
> >>> next weeks. Let's continue our dialog.
> >>>
> >>>
> >>>> Thank you very much in advance.
> >>>>
> >>>> Regards,
> >>>> jfcg
> >>>>
> >>>> _________________________________________________________________
> >>>> ¿Quieres los nuevos emoticonos en 3D? ¡Descárgatelos gratis!
> >>>> http://www.vivelive.com/emoticonos3d/index2.html
> >>>>
> >>>>
> >>>
> >> _________________________________________________________________
> >> Internet Explorer 8 más sencillo y seguro ¡Descárgatelo gratis!
> >> http://events.es.msn.com/noticias/internet-explorer-8/
> >>
> >>
> >
> >
> 
> 
> -- 
> Ted Dunning, CTO
> DeepDyve

_________________________________________________________________
Con Vodafone disfruta de Hotmail gratis en tu móvil. ¡Pruébalo!
http://serviciosmoviles.es.msn.com/hotmail/vodafone.aspx

RE: String clustering and other newbie questions

Reply via email to