I don't know if there is already an analyzer available for this, but you
could use GATE or UIMA for Named Entity Extraction against names and
expand the query to include the extra names that are used synonymously.
You could do this outside Lucene or inline using a custom Lucene
tokenizer that embeds either a GATE or UIMA NER.

If you go the custom route (and you are not familiar with GATE or UIMA),
you may want to take a look at Dr Manu Konchady's book on Lingpipe,
Lucene and GATE - there is code in there to embed a GATE NER into a
Lucene tokenizer (although its not a streaming tokenizer due to the
nature of the NER process). The process would be similar for embedding a
UIMA NER.

GATE (ANNIE) contains data files that list the common synonyms (eg. Bill
== William, Bob == Robert, Tom == Thomas, etc) which you can leverage
with GATE's Jape rule language. Alternatively, you could use the same
data from UIMA using a custom analysis engine (I prefer this route
because this is all Java, easier learning curve and maintainability).

-sujit

On Thu, 2011-03-24 at 14:31 -0400, Deepak Konidena wrote:
> Hi,
> 
> I  would like to build a search system where a search for "Dan" would also 
> search for "Daniel" and a search for "Will", "William" . Any ideas on how to 
> go about implementing that? I can think of writing a custom Analyzer that 
> would map these partial tokens to their full firstname or lastnames. But is 
> there an Analyzer in lucene contrib modules or elsewhere that does a similar 
> job for me?
> 
> Thanks,
> Deepak Konidena.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to