On 2/2/12 7:40 AM, Desilets, Alain wrote:
Thx Peter. In my case, the fields on which I need to do wild-card searches are 
fields that specify the URL of a document. I want to be able to use this to 
limit the search to documents which are on specific web sites.

It seems the best balance in that case, between accuracy and speed, would be to tokenize 
on non word character. Then, I could retrieve a superset of docs on say, 
www.somewhere.org, by searching for "www.somewhere.org" (with a QueryParser). 
This might accidentally retrieve docs whose urls contain www/somewhwere/org (for 
example), but I would do a second pass to filter the docs whose url do not match the 
actual expression www.somewhere.org. I would need to do this second pass anyway, even if 
I was using a WildCard search, because, I might accidentally match a URL that has 
www.somewhere.org in a different part than the IP name (ex: 
http:/www.aplace.com/www.somewhere.org.html).


why not pull the hostname out at indexing time into its own field? then your particular use case should get no false positives?


--
Peter Karman  .  http://peknet.com/  .  [email protected]

Reply via email to