Hi, Take the NLP route and use modules like POS tagger and NP chunker.
OpenNLP has a stack for English language. Try to use them. Regards Vasu On Tue, Oct 6, 2009 at 5:12 PM, Andrew Zhang <rooseve6...@gmail.com> wrote: > Hi guys, > > The requirement is very simple here, e.g. for this sentence, 'The NBA > formally announced its new *social media* guidelines Wednesday', I want to > treat '*social media*' as a whole phase term. The default english analyzers > came with lucene all deal with single word, so it you want to get the most > frequent terms, *social *and *media* are separated, and each of them can't > represent a good meaning as *social media*, right? > > I know there's a way built on some phase dictionary, and try to match the > phase already there, very like the way to do with chinese language, but is > there an open source solution for english, I mean I don't want to build a > phase dictionary myself, and I also want a smart way, which can "discover" > the phase automatically. I got 2 millions docs analyzered the norma way, > all > single terms, which I can use as a base source, and it's possible to find > that *social media *came together frequently, but I really don't know > what's > the reverse way. > > I tried to find some phase analyzers, but no luck. so any advices? > > Regards, > Andrew > -- > Simple is best >