Re: [Lucene.Net] English Language Concatenated Word Tokenizer

digy digy Sun, 03 Apr 2011 23:48:25 -0700

Instead of splitting the token into meaningful words, you may want to try to
use the SingleCharTokenAnalyzer in contrib.
It allows %text% searches.
(
http://svn.apache.org/viewvc/incubator/lucene.net/trunk/C%23/contrib/Contrib.Net/Contrib.Net/Analysis/Ext/Analysis.Ext.cs
)


given "www.worldbestwebsites.com", you can search "world", "best", "web",
"website", "bestweb" or "stwebsi" :)  etc.

DIGY

On Mon, Apr 4, 2011 at 7:15 AM, Thomas Rankin <t...@tomrankin.net> wrote:

> Hey everyone, I'm trying to figure out the best way to get lucene to detect
> concatenated words in a body of copy or a URL.
>
> I've got a few scenarios I'm trying to handle.  Many times in source code
> and URL's, several words are concatenated together to create a meaningful
> string ie. UserRegistrationService.cs and www.worldbestwebsites.com.  I
> would like index these as User Registration Service cs and www world best
> websites com etc.  I'm not expecting an easy answer, but would like to know
> how the community at large is dealing with these types of scenarios.
>
> Thanks,
>
> Thomas
>

Re: [Lucene.Net] English Language Concatenated Word Tokenizer

Reply via email to