Re: any ides on this type of analyzer?

Soeren Pekrul Fri, 01 Dec 2006 02:10:34 -0800

Hello Van,

it looks like splitting of compound words. This topic was discussed inthe thread "Analysis/tokenization of compound words"(http://www.gossamer-threads.com/lists/lucene/java-user/40164?do=post_view_threaded).


The main idea is as follow:

You have a corpus (lexicon/dictionary). You want to split a word. Buildletter wise two part words and search them in the corpus. If you canfind them you can split your word at these parts. You can do thatrecursively finding all sub-parts. This method searches the smallestpossible words. [1]If you have a corpus without compound words you can have a betterquality by searching the largest possible words by searching the wordfirst and then splitting it if you couldn’t find it. Do this recursivelyas well. [2]

Warning: The results are usually not linguistically correct. How ever,the quality should be good enough for searching.


Sören

[1] C. Monz and M. De Rijke: Shallow Morphological Analysis inMonolingual Information Retrieval for Dutch, German and Italian,Language & Inference Technology, University of Amsterdam.

[2] Pasqualino Imbemba: A Splitter for German Compound Words, FreeUniversity Of Bolzano, Bolzano, 2006


Van Nguyen wrote:

I've been trying to brainstorm on this but could not figure out a way to
go about this.

Let's say I'm searching for "batman".  I want results that include:

batman

bat man

bat-man

etc.

or if I search screwdriver, I would want results to include:

screwdriver

screw drivers

etc.

I've tried using the SnowballAnalyzer.  I've thought about creating a
"SynonymAnalyzer" as described in the Lucene In Action book (but that
would mean I would have to know all the synonyms for each word I need to
index - at this point I do not).  Any suggestions on how to go about
this?

Van


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: any ides on this type of analyzer?

Reply via email to