It sounds like you've been asked to implement Named Entity Recognition. OpenNLP has some capability here. There are also, um, commercial alternatives.
On Thu, Feb 20, 2014 at 6:24 AM, Yann-Erwan Perio <ye.pe...@gmail.com>wrote: > On Thu, Feb 20, 2014 at 10:46 AM, Geet Gangwar <geetgang...@gmail.com> > wrote: > > Hi, > > > My requirement is it should have capabilities to match multiple words as > > one token. for example. When user passes String as International Business > > machine logo or IBM logo it should return International Business Machine > as > > one token and logo as one token. > > This is an interesting problem. I suppose that if the user enters > "International Business Machines", possibly with some misspelling, you > want to find all documents containing "IBM" - and that if he enters > the string "IBM", you want to find documents which contain the string > "International Business Machines", or even only parts of it. So this > means you need some kind of map relating some acronyms with their > content parts. There really are two directions here: acronym to > content and content to acronym. > > One cannot find what an acronym means without some kind of acronym > dictionary. This means that whatever approach you intend to use, there > should be an external dictionary involved, which, for each acronym, > would map a list of possible phrases. Retrieving all phrases matching > the inputted acronym, you'd inject each part of each phrase as a token > (removing possible duplicates between phrase parts). That's basically > it for the direction "acronym to content". > > The direction "content to acronym" is trickier, I believe. One way is > to generate a second (reversed) map, matching each acronym content > part to a list of acronyms containing that part. You'd simply inject > acronyms (and possibly other things) if one part of their content is > matched (or more than one part, if you want to increase relevance). > This could however possibly require the definition of a specific > hashing mechanism, if you want to find approximate (distanced) keys > (e.g. "intenational", with the lacking "r", would still find "IBM"). A > second way (more coupled to the concept of acronym, so less generic) > could be to consider that every word starting with a capital letter if > part of an acronym, buffering sequences of words starting with a > capital letter, and eventually injecting the resulting acronym, if > found in the acronym dictionary. This might not be safe, though - the > user may not have the discipline to capitalize the words being part of > an acronym (or may even misspell the first letter), or concatenated > first letters could match an irrelevant acronym (many word sequences > can give the acronym "IBM"). > > I do not know whether there already exists some Lucene module which > processes acronyms, or if someone is working on one. It's definitely > worth a search though, because writing a good one from scratch could > mean a few days of work, or more. > > HTH. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >