On Tue, Jan 31, 2012 at 01:19:38PM -0500, Desilets, Alain wrote: > I was wondering if there was a way to tokenize the string into individual > characters instead, and whether that is advisable from a performance point > of view.
You can experiment with changing the 'pattern' argument to RegexTokenizer#new to be '.' or '\\S'. It will definitely be worse from a performance standpoint, as matching a URL will now require a PhraseQuery with one term for each letter rather than one term for each component matching \w+ in the URL, and these terms will exist in virtually every document. Marvin Humphrey
