There are many uses for shingles.
I've used them to find common phrases in text, which is my
understanding of what you try to achieve. It works rather well, is a
very simple solution and easy on resources compared to real semantic
analysis.
You'll be getting a lot of shingles such as "there is" and "we are",
but using a stop word lists to filter out any shingle contaning one or
many of the stop words should do the trick (I did that in post
processing, keeping all shingles in my index). It will probably
require bit of manual work, depending on your corpora, to get a really
clean list of common phrases that makes sense. Just create a list and
inspect it with your eyes an try to find patterns in the phrases you
want to get rid of. You might also want to look for punctuation in
your text to avoid creating shingles of text that is in diffrent
sentences. There is a pretty good sentence extraction tool in Gate you
can use.
karl
7 okt 2009 kl. 01.39 skrev Andrew Zhang:
Hi Karl,
I think shingle is designed to make the phase search faster, it'll
generate
a lot of "seemed like" phase by pos only and completely disregard the
meaning, that's not good enough.
Regards,
Andrew
On Tue, Oct 6, 2009 at 11:51 PM, Karl Wettin <[email protected]>
wrote:
Hi Andrew,
I think you are looking for the shingle package in contrib/analyzers.
karl
6 okt 2009 kl. 13.42 skrev Andrew Zhang:
Hi guys,
The requirement is very simple here, e.g. for this sentence, 'The
NBA
formally announced its new *social media* guidelines Wednesday',
I want
to
treat '*social media*' as a whole phase term. The default english
analyzers
came with lucene all deal with single word, so it you want to get
the most
frequent terms, *social *and *media* are separated, and each of
them can't
represent a good meaning as *social media*, right?
I know there's a way built on some phase dictionary, and try to
match the
phase already there, very like the way to do with chinese
language, but is
there an open source solution for english, I mean I don't want to
build a
phase dictionary myself, and I also want a smart way, which can
"discover"
the phase automatically. I got 2 millions docs analyzered the
norma way,
all
single terms, which I can use as a base source, and it's possible
to find
that *social media *came together frequently, but I really don't
know
what's
the reverse way.
I tried to find some phase analyzers, but no luck. so any advices?
Regards,
Andrew
--
Simple is best
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
--
Simple is best
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]