On 2/2/12 7:40 AM, Desilets, Alain wrote:
Thx Peter. In my case, the fields on which I need to do wild-card searches are
fields that specify the URL of a document. I want to be able to use this to
limit the search to documents which are on specific web sites.
It seems the best balance in that case, between accuracy and speed, would be to tokenize
on non word character. Then, I could retrieve a superset of docs on say,
www.somewhere.org, by searching for "www.somewhere.org" (with a QueryParser).
This might accidentally retrieve docs whose urls contain www/somewhwere/org (for
example), but I would do a second pass to filter the docs whose url do not match the
actual expression www.somewhere.org. I would need to do this second pass anyway, even if
I was using a WildCard search, because, I might accidentally match a URL that has
www.somewhere.org in a different part than the IP name (ex:
http:/www.aplace.com/www.somewhere.org.html).
why not pull the hostname out at indexing time into its own field? then
your particular use case should get no false positives?
--
Peter Karman . http://peknet.com/ . [email protected]