Re: [ngram] Re: Clustering with ngram

Ted Pedersen Fri, 14 Jan 2011 14:10:53 -0800

Hi Jelena,

Got it! What you are describing makes perfect sense, and seems like it suits
Text::NSP pretty well. Clustering is one of those terms that means many
different things to different people, so that's why I asked for a little
more detail.

I very much like your idea of using the --token option to specify clusters
you are interested in finding. In fact I've often felt the --token option is
very underutilized, and this is the sort of more extended use I'm always
hoping to see. Given your example below, it should be pretty easy to put
together a token regex file to find other occurrences for your clusters...

You could do something pretty literal...

/\b sales for watches\b/
/\w+/

or something more flexible...

/\b(sales|discounts) (for|on) (watches|clocks)\b/
/\w+/

So if you did this with --ngram 3 in count.pl, you could find all trigrams
where your "sales for watches" cluster/string is treated as just one of the
three components of the ngram. Just remember that in your --token option,
you have the ability to specify what a single token or term should be, and
then those are combined together to create ngrams of whatever size you
specify. So the above doesn't guarantee that all the ngrams would have
"sales for watches" in them, but that any text encountered that did include
that would be "counted" as a part of an ngram.

SenseClusters uses NSP and would allow for the same kind of tokenization
strategies, so if you wanted to find these kinds of clusters based on NSP
output and then use those as input to a more complex clustering process you
could certain do something like this with SenseClusters, although it sounds
to me like NSP might actually do quite a bit of what you want, assuming you
do some interesting things with your regular expressions for the --token
option.

I hope this makes a little sense. Please let me know if there's something
about this that doesn't make sense, or doesn't work as you hope.

Good luck!
Ted

On Fri, Jan 14, 2011 at 12:30 PM, jelena_isacenkova
<jelena.i...@gmail.com>wrote:

>
>
> Thanks for a quick reply, Ted. Here is the kind of data I have as an
> example:
> 20% sales for watches, Ann
> 50% sales for watches, Peter
> 70% sales for watches, Tom
> ... etc.
>
> Using ngram I am able to find a 'stem' of the phrase, which is
> 'sales<>for<>watches<>,' and has a length of 4-gram. This is what I call a
> cluster. Additionally I add 2 extra parameters: date & size of the email.
> later I use them to evaluate the cluster.
> Next, I take a line similar to the example, and I want to recognize the
> cluster to which it belongs or to know that it belongs to none in my list
> only by text.
>
> I tried the sense cluster some time ago on my data, to be honest I was lost
> in it's options and the results I got were not good. You think there might
> be an option for such a task?
>
> Cheers,
> Jelena
>
>
> --- In ngram@yahoogroups.com <ngram%40yahoogroups.com>, Ted Pedersen
> <tpederse@...> wrote:
> >
> > Hi Jelena,
> >
> > Could you describe in a little more detail how you are using Text::NSP as
> a
> > part of the clustering work you describe?
> >
> > http://ngram.sourceforge.net
> >
> > Text::NSP doesn't have native support for clustering, although it's easy
> to
> > imagine using it as a part of a larger clustering process (and in fact
> > that's what we do in our SenseClusters package
> > http://senseclusters.sourceforge.net ).
> >
> > In any case, if you can describe where Text::NSP fits into this picture
> I'm
> > sure we'll be able to make some suggestions.
> >
> > Cordially,
> > Ted
> >
> > On Fri, Jan 14, 2011 at 9:51 AM, jelena_isacenkova <jelena.info
> @...>wrote:
>
> >
> > >
> > >
> > > Hi guys,
> > >
> > > I am currently using the module to cluster the text lines (phrases) in
> > > n-grams and it's working fine. I would like to also cluster the new
> incoming
> > > text records based on the clustered data. I am sure it is a common
> thing to
> > > do, however, it is not described in the package.
> > >
> > > I have an idea of using a token option for describing a list of already
> > > existing cluster in order to check if one matches to any of existing
> > > clusters. Would be very interested to know your opinion and comments.
> > >
> > > Thanks in advance,
> > > Jelena
> > >
> > >
> > >
> >
> >
> >
> > --
> > Ted Pedersen
> > http://www.d.umn.edu/~tpederse <http://www.d.umn.edu/%7Etpederse>
> >
>
>  
>

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Re: [ngram] Re: Clustering with ngram

Reply via email to