Hi Andrew,

I think you are looking for the shingle package in contrib/analyzers.


      karl

6 okt 2009 kl. 13.42 skrev Andrew Zhang:

Hi guys,

The requirement is very simple here, e.g. for this sentence, 'The NBA
formally announced its new *social media* guidelines Wednesday', I want to treat '*social media*' as a whole phase term. The default english analyzers came with lucene all deal with single word, so it you want to get the most frequent terms, *social *and *media* are separated, and each of them can't
represent a good meaning as *social media*, right?

I know there's a way built on some phase dictionary, and try to match the phase already there, very like the way to do with chinese language, but is there an open source solution for english, I mean I don't want to build a phase dictionary myself, and I also want a smart way, which can "discover" the phase automatically. I got 2 millions docs analyzered the norma way, all single terms, which I can use as a base source, and it's possible to find that *social media *came together frequently, but I really don't know what's
the reverse way.

I tried to find some phase analyzers, but no luck. so any advices?

Regards,
Andrew
--
Simple is best


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to