How would one go about this ----->>>>> "During indexing, your indexing filter could add a field named "sitecluster"" Could I create a field called "region" and apply to sites based on there location . If so how. Can this be tweaked in config file

-Bud


On Nov 30, 2005, at 12:29 PM, Andy Lee wrote:

On Nov 30, 2005, at 1:20 AM, Matt Kangas wrote:
- if you only want to match one site at a time, you can just add "site:xxx" to the query. the "site" field exists in the index by default

Note that the index-basic indexing filter does not tokenize the "site" field, so if you do "site:salami.com" you will only match URLs whose host component exactly matches the value you give -- http://salami.com/etc and ftp://salami.com/etc but NOT http:// www.salami.com. This may or may not be what you want.

- if you want assign ids to clusters of sites, you can do the site- >id lookup at index time and add a custom field to the index

This is one way to address the above issue. During indexing, your indexing filter could add a field named "sitecluster" (or whatever), and for all the above URLs (and anything else you want to cluster with them) you would set "salami.com" as the value of that field. Then your search would be "sitecluster:salami.com".

Another approach would be to search not on the "site" field but the "url" field, which *is* tokenized at indexing time. So "url:salami" would find all the salami URLs above, as well as http://www2.salami.com and http://www.salami.org and http:// salami.lunch.com -- which again may or may not be what you want.

--Andy




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to