Even if I did that, I would still need to search the domain as a non-exact 
value. For example, I might want to search on *.gc.ca to search all Government 
of Canada web sites, or search on *.nrc-cnrc.gc.ca to search only on the site 
of the National Research Council of Canada, or limit the search to any web site 
in Canada, like *.ca.

Alain

-----Original Message-----
From: Peter Karman [mailto:[email protected]] 
Sent: Thursday, February 02, 2012 9:27 AM
To: '[email protected]'
Subject: Re: [lucy-user] Can lucy do substring search?

On 2/2/12 7:40 AM, Desilets, Alain wrote:
> Thx Peter. In my case, the fields on which I need to do wild-card searches 
> are fields that specify the URL of a document. I want to be able to use this 
> to limit the search to documents which are on specific web sites.
>
> It seems the best balance in that case, between accuracy and speed, would be 
> to tokenize on non word character. Then, I could retrieve a superset of docs 
> on say, www.somewhere.org, by searching for "www.somewhere.org" (with a 
> QueryParser). This might accidentally retrieve docs whose urls contain 
> www/somewhwere/org (for example), but I would do a second pass to filter the 
> docs whose url do not match the actual expression www.somewhere.org. I would 
> need to do this second pass anyway, even if I was using a WildCard search, 
> because, I might accidentally match a URL that has www.somewhere.org in a 
> different part than the IP name (ex: 
> http:/www.aplace.com/www.somewhere.org.html).
>

why not pull the hostname out at indexing time into its own field? then 
your particular use case should get no false positives?


-- 
Peter Karman  .  http://peknet.com/  .  [email protected]

Reply via email to