Even if I did that, I would still need to search the domain as a non-exact value. For example, I might want to search on *.gc.ca to search all Government of Canada web sites, or search on *.nrc-cnrc.gc.ca to search only on the site of the National Research Council of Canada, or limit the search to any web site in Canada, like *.ca.
Alain -----Original Message----- From: Peter Karman [mailto:[email protected]] Sent: Thursday, February 02, 2012 9:27 AM To: '[email protected]' Subject: Re: [lucy-user] Can lucy do substring search? On 2/2/12 7:40 AM, Desilets, Alain wrote: > Thx Peter. In my case, the fields on which I need to do wild-card searches > are fields that specify the URL of a document. I want to be able to use this > to limit the search to documents which are on specific web sites. > > It seems the best balance in that case, between accuracy and speed, would be > to tokenize on non word character. Then, I could retrieve a superset of docs > on say, www.somewhere.org, by searching for "www.somewhere.org" (with a > QueryParser). This might accidentally retrieve docs whose urls contain > www/somewhwere/org (for example), but I would do a second pass to filter the > docs whose url do not match the actual expression www.somewhere.org. I would > need to do this second pass anyway, even if I was using a WildCard search, > because, I might accidentally match a URL that has www.somewhere.org in a > different part than the IP name (ex: > http:/www.aplace.com/www.somewhere.org.html). > why not pull the hostname out at indexing time into its own field? then your particular use case should get no false positives? -- Peter Karman . http://peknet.com/ . [email protected]
