Re: Custom Tokenizer/Analyzer

Benson Margulies Thu, 20 Feb 2014 04:22:13 -0800

It sounds like you've been asked to implement Named Entity Recognition.
OpenNLP has some capability here. There are also, um, commercial
alternatives.



On Thu, Feb 20, 2014 at 6:24 AM, Yann-Erwan Perio <ye.pe...@gmail.com>wrote:

> On Thu, Feb 20, 2014 at 10:46 AM, Geet Gangwar <geetgang...@gmail.com>
> wrote:
>
> Hi,
>
> > My requirement is it should have capabilities to match multiple words as
> > one token. for example. When user passes String as International Business
> > machine logo or IBM logo it should return International Business Machine
> as
> > one token and logo as one token.
>
> This is an interesting problem. I suppose that if the user enters
> "International Business Machines", possibly with some misspelling, you
> want to find all documents containing "IBM" - and that if he enters
> the string "IBM", you want to find documents which contain the string
> "International Business Machines", or even only parts of it. So this
> means you need some kind of map relating some acronyms with their
> content parts. There really are two directions here: acronym to
> content and content to acronym.
>
> One cannot find what an acronym means without some kind of acronym
> dictionary. This means that whatever approach you intend to use, there
> should be an external dictionary involved, which, for each acronym,
> would map a list of possible phrases. Retrieving all phrases matching
> the inputted acronym, you'd inject each part of each phrase as a token
> (removing possible duplicates between phrase parts). That's basically
> it for the direction "acronym to content".
>
> The direction "content to acronym" is trickier, I believe. One way is
> to generate a second (reversed) map, matching each acronym content
> part to a list of acronyms containing that part. You'd simply inject
> acronyms (and possibly other things) if one part of their content is
> matched (or more than one part, if you want to increase relevance).
> This could however possibly require the definition of a specific
> hashing mechanism, if you want to find approximate (distanced) keys
> (e.g. "intenational", with the lacking "r", would still find "IBM"). A
> second way (more coupled to the concept of acronym, so less generic)
> could be to consider that every word starting with a capital letter if
> part of an acronym, buffering sequences of words starting with a
> capital letter, and eventually injecting the resulting acronym, if
> found in the acronym dictionary. This might not be safe, though - the
> user may not have the discipline to capitalize the words being part of
> an acronym (or may even misspell the first letter), or concatenated
> first letters could match an irrelevant acronym (many word sequences
> can give the acronym "IBM").
>
> I do not know whether there already exists some Lucene module which
> processes acronyms, or if someone is working on one. It's definitely
> worth a search though, because writing a good one from scratch could
> mean a few days of work, or more.
>
> HTH.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Custom Tokenizer/Analyzer

Reply via email to