Hi Jelena, Got it! What you are describing makes perfect sense, and seems like it suits Text::NSP pretty well. Clustering is one of those terms that means many different things to different people, so that's why I asked for a little more detail.
I very much like your idea of using the --token option to specify clusters you are interested in finding. In fact I've often felt the --token option is very underutilized, and this is the sort of more extended use I'm always hoping to see. Given your example below, it should be pretty easy to put together a token regex file to find other occurrences for your clusters... You could do something pretty literal... /\b sales for watches\b/ /\w+/ or something more flexible... /\b(sales|discounts) (for|on) (watches|clocks)\b/ /\w+/ So if you did this with --ngram 3 in count.pl, you could find all trigrams where your "sales for watches" cluster/string is treated as just one of the three components of the ngram. Just remember that in your --token option, you have the ability to specify what a single token or term should be, and then those are combined together to create ngrams of whatever size you specify. So the above doesn't guarantee that all the ngrams would have "sales for watches" in them, but that any text encountered that did include that would be "counted" as a part of an ngram. SenseClusters uses NSP and would allow for the same kind of tokenization strategies, so if you wanted to find these kinds of clusters based on NSP output and then use those as input to a more complex clustering process you could certain do something like this with SenseClusters, although it sounds to me like NSP might actually do quite a bit of what you want, assuming you do some interesting things with your regular expressions for the --token option. I hope this makes a little sense. Please let me know if there's something about this that doesn't make sense, or doesn't work as you hope. Good luck! Ted On Fri, Jan 14, 2011 at 12:30 PM, jelena_isacenkova <jelena.i...@gmail.com>wrote: > > > Thanks for a quick reply, Ted. Here is the kind of data I have as an > example: > 20% sales for watches, Ann > 50% sales for watches, Peter > 70% sales for watches, Tom > ... etc. > > Using ngram I am able to find a 'stem' of the phrase, which is > 'sales<>for<>watches<>,' and has a length of 4-gram. This is what I call a > cluster. Additionally I add 2 extra parameters: date & size of the email. > later I use them to evaluate the cluster. > Next, I take a line similar to the example, and I want to recognize the > cluster to which it belongs or to know that it belongs to none in my list > only by text. > > I tried the sense cluster some time ago on my data, to be honest I was lost > in it's options and the results I got were not good. You think there might > be an option for such a task? > > Cheers, > Jelena > > > --- In ngram@yahoogroups.com <ngram%40yahoogroups.com>, Ted Pedersen > <tpederse@...> wrote: > > > > Hi Jelena, > > > > Could you describe in a little more detail how you are using Text::NSP as > a > > part of the clustering work you describe? > > > > http://ngram.sourceforge.net > > > > Text::NSP doesn't have native support for clustering, although it's easy > to > > imagine using it as a part of a larger clustering process (and in fact > > that's what we do in our SenseClusters package > > http://senseclusters.sourceforge.net ). > > > > In any case, if you can describe where Text::NSP fits into this picture > I'm > > sure we'll be able to make some suggestions. > > > > Cordially, > > Ted > > > > On Fri, Jan 14, 2011 at 9:51 AM, jelena_isacenkova <jelena.info > @...>wrote: > > > > > > > > > > > > Hi guys, > > > > > > I am currently using the module to cluster the text lines (phrases) in > > > n-grams and it's working fine. I would like to also cluster the new > incoming > > > text records based on the clustered data. I am sure it is a common > thing to > > > do, however, it is not described in the package. > > > > > > I have an idea of using a token option for describing a list of already > > > existing cluster in order to check if one matches to any of existing > > > clusters. Would be very interested to know your opinion and comments. > > > > > > Thanks in advance, > > > Jelena > > > > > > > > > > > > > > > > > -- > > Ted Pedersen > > http://www.d.umn.edu/~tpederse <http://www.d.umn.edu/%7Etpederse> > > > > > -- Ted Pedersen http://www.d.umn.edu/~tpederse