On Fri, 4 May 2001, Don Gourley wrote:
+ I've recently built and installed htdig-3.2.0b3 and it is
+ working pretty well. However, it is indexing more docs
+ than I would like. I can't get into ftp.htdig.org right
+ now to search the contrib stuff (server not responding),
+ and I wonder if there is any script there to list in text
+ form the documents that have been indexed from, say, the
+ db.docs.index file?
I have recently extended my script that builds a report from htdig's
stdout stream. Some of it needs the -s output (the list of unindexed
pages) some uses the regular output. It is only tested on 3.1.5 with one
-v on htdig.
http://wwwsearch.ox.ac.uk/dig_report.pl
You can view the output in http://wwwsearch.ox.ac.uk/report/
+ Also, is my assumption correct that if a document is
+ excluded via exclude_urls then no links in it are followed
+ for indexing, even if those links wouldn't by themselves
+ be excluded?
Quite right -
excluding implies no fetching which implies <noindex,nofollow>
+ Finally, what would be the best way to avoid having
+ "equivalent" documents indexed multiple times when they
+ are referenced by slightly different URLs, such as:
+
+ http://websource.wrlc.org:8000/voyager/stgfac/
+ http://websource.wrlc.org:8000/voyager/stgfac/index.html
+ http://websource.wrlc.org:8000/voyager/stgfac/?N=D
+ http://websource.wrlc.org:8000/voyager/stgfac/?D=A
That's not one I've found a complete solution to. I tend to have a
smallish number of substantial branches that might be known by multiple
names - usually of the form of www.foo.ox.ac.uk being equivalent to
www.bar.ox.ac.uk/foo/ which I deal with by excluding one of the branches
if I happen to spot to duplication. I know I ought to do it with
url_part_aliases 8-)
regards,
Malcolm.
[EMAIL PROTECTED] http://users.ox.ac.uk/~malcolm/
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html