Instead of splitting the token into meaningful words, you may want to try to use the SingleCharTokenAnalyzer in contrib. It allows %text% searches. ( http://svn.apache.org/viewvc/incubator/lucene.net/trunk/C%23/contrib/Contrib.Net/Contrib.Net/Analysis/Ext/Analysis.Ext.cs )
given "www.worldbestwebsites.com", you can search "world", "best", "web", "website", "bestweb" or "stwebsi" :) etc. DIGY On Mon, Apr 4, 2011 at 7:15 AM, Thomas Rankin <t...@tomrankin.net> wrote: > Hey everyone, I'm trying to figure out the best way to get lucene to detect > concatenated words in a body of copy or a URL. > > I've got a few scenarios I'm trying to handle. Many times in source code > and URL's, several words are concatenated together to create a meaningful > string ie. UserRegistrationService.cs and www.worldbestwebsites.com. I > would like index these as User Registration Service cs and www world best > websites com etc. I'm not expecting an easy answer, but would like to know > how the community at large is dealing with these types of scenarios. > > Thanks, > > Thomas >