-- -Geoff Hutchison Williams Students Online http://wso.williams.edu/
---------- Forwarded message ---------- Date: Sun, 29 Dec 2002 17:25:22 -0800 (PST) From: Andrew Daviel <[EMAIL PROTECTED]> To: Geoff Hutchison <[EMAIL PROTECTED]> Subject: geographic searching and ht//Dig versions Hi A long time ago (well, about 10 months I think) I was working on a geographic-enabled version of ht//Dig which picks up location from metadata and allows a "sort by closest" search. I think at the time you (or someone) expressed interest in this. It is online at http://geotags.com/ with a few (hundred) pages indexed (try "disney" or "restaurant") I was trying to get back on this again and was trying to index a moderately large site that someone has added metadata to, and was having some trouble with htdig hanging (the log output with -vv stops, but htdig continues to eat CPU). I am currently using a modified version of 3.2.0b3 I am also using (as I recall the same version) htdig for regular search within a domain. I was wondering if I would be better to use the production version 3.1.6, and if I might submit the patches to the general effort. For the domain search (at www.triumf.ca) I wanted to allow a search on author name. We have quite a few scientific preprints in PostScript and PDF; the metadata in PDF (and in many HTML editors) typically includes author name, keywords, subject and title. (to tell the truth we have a lot of pages without even a title, or a title of "frrt56.tex", but that's an education problem...) Adding subject, title and keywords to the general index is reasonable, but there is a distinct difference between someone having written a paper and just being mentioned in it, so I added an author entry. For the geographic search I have 3 metadata values I collect - position, region and placename. So to allow htdig to collect all these I added 4 entries to the DOC structure. To control indexing I added a config boolean "require_geo". I was intending to have another config value "require_region" to restrict indexing to one geographic region. Currently if require_geo is true then a region metadata acts like robots=follow,noindex while position metadata acts like robots=all. This is somewhat messy but works more-or-less. Since the overwhelming majority of web pages do not have position data I do not want to visit them all to check, but want to allow lists of links to position-enabled pages and to follow links that might have them. I think I currently follow all links up to max_hop_count and then follow links forever if a region is present; I might change this, it is probably better to require a region on index pages and honour max_hop_count. If require_geo is false or not present, htdig runs normally. It will index position and region data if it finds it and can include it in search results, but normally it won't find any. So I think it is back-compatible with regular htdig. regards Andrew Daviel Vancouver Webpages ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ htdig-dev mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/htdig-dev
