--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

---------- Forwarded message ----------
Date: Sun, 29 Dec 2002 17:25:22 -0800 (PST)
From: Andrew Daviel <[EMAIL PROTECTED]>
To: Geoff Hutchison <[EMAIL PROTECTED]>
Subject: geographic searching and ht//Dig versions


Hi

A long time ago (well, about 10 months I think) I was working on a
geographic-enabled version of ht//Dig which picks up location from
metadata and allows a "sort by closest" search.
I think at the time you (or someone) expressed interest in this.
It is online at http://geotags.com/ with a few (hundred) pages indexed
(try "disney" or "restaurant")

I was trying to get back on this again and was trying to index a
moderately large site that someone has added metadata to, and was having
some trouble with htdig hanging (the log output with -vv stops, but htdig
continues to eat CPU). I am currently using a modified version of
3.2.0b3

I am also using (as I recall the same version) htdig for regular search
within a domain.

I was wondering if I would be better to use the production version 3.1.6,
and if I might submit the patches to the general effort.

For the domain search (at www.triumf.ca) I wanted to allow a search on
author name. We have quite a few scientific preprints in PostScript
and PDF; the metadata in PDF (and in many HTML editors)
typically includes author name, keywords, subject and title.
(to tell the truth we have a lot of pages without even a title, or
a title of "frrt56.tex", but that's an education problem...)
Adding subject, title and keywords to the general index is
reasonable, but there is a distinct difference between someone having
written a paper and just being mentioned in it, so I added an
author entry.

For the geographic search I have 3 metadata values I collect - position,
region and placename. So to allow htdig to collect all these I added
4 entries to the DOC structure.

To control indexing I added a config boolean "require_geo".
I was intending to have another config value "require_region" to restrict
indexing to one geographic region. Currently if require_geo is true
then a region metadata acts like robots=follow,noindex while position
metadata acts like robots=all. This is
somewhat messy but works more-or-less. Since the overwhelming majority
of web pages do not have position data I do not want to visit them all
to check, but want to allow lists of links to position-enabled pages and
to follow links that might have them. I think I currently follow
all links up to max_hop_count and then follow links forever if a region
is present; I might change this, it is probably better to require a region
on index pages and honour max_hop_count.


If require_geo is false or not present, htdig runs normally. It will index
position and region data if it finds it and can include it in search
results, but normally it won't find any. So I think it is back-compatible
with regular htdig.


regards

Andrew Daviel
Vancouver Webpages




-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to