Hi Andrew,
I think you are looking for the shingle package in contrib/analyzers.
karl
6 okt 2009 kl. 13.42 skrev Andrew Zhang:
Hi guys,
The requirement is very simple here, e.g. for this sentence, 'The NBA
formally announced its new *social media* guidelines Wednesday', I
want to
treat '*social media*' as a whole phase term. The default english
analyzers
came with lucene all deal with single word, so it you want to get
the most
frequent terms, *social *and *media* are separated, and each of them
can't
represent a good meaning as *social media*, right?
I know there's a way built on some phase dictionary, and try to
match the
phase already there, very like the way to do with chinese language,
but is
there an open source solution for english, I mean I don't want to
build a
phase dictionary myself, and I also want a smart way, which can
"discover"
the phase automatically. I got 2 millions docs analyzered the norma
way, all
single terms, which I can use as a base source, and it's possible to
find
that *social media *came together frequently, but I really don't
know what's
the reverse way.
I tried to find some phase analyzers, but no luck. so any advices?
Regards,
Andrew
--
Simple is best
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org