Re: [htdig] what docs are indexed?

Malcolm Austen Fri, 04 May 2001 08:46:48 -0700
On Fri, 4 May 2001, Don Gourley wrote:

+ I've recently built and installed htdig-3.2.0b3 and it is
+ working pretty well.  However, it is indexing more docs
+ than I would like.  I can't get into ftp.htdig.org right
+ now to search the contrib stuff (server not responding),
+ and I wonder if there is any script there to list in text
+ form the documents that have been indexed from, say, the
+ db.docs.index file?

I have recently extended my script that builds a report from htdig's
stdout stream. Some of it needs the -s output (the list of unindexed
pages) some uses the regular output. It is only tested on 3.1.5 with one
-v on htdig.

        http://wwwsearch.ox.ac.uk/dig_report.pl

You can view the output in http://wwwsearch.ox.ac.uk/report/

+ Also, is my assumption correct that if a document is
+ excluded via exclude_urls then no links in it are followed
+ for indexing, even if those links wouldn't by themselves
+ be excluded?

Quite right -
 excluding implies no fetching which implies <noindex,nofollow>

+ Finally, what would be the best way to avoid having
+ "equivalent" documents indexed multiple times when they
+ are referenced by slightly different URLs, such as:
+
+ http://websource.wrlc.org:8000/voyager/stgfac/
+ http://websource.wrlc.org:8000/voyager/stgfac/index.html
+ http://websource.wrlc.org:8000/voyager/stgfac/?N=D
+ http://websource.wrlc.org:8000/voyager/stgfac/?D=A

That's not one I've found a complete solution to. I tend to have a
smallish number of substantial branches that might be known by multiple
names - usually of the form of www.foo.ox.ac.uk being equivalent to
www.bar.ox.ac.uk/foo/ which I deal with by excluding one of the branches
if I happen to spot to duplication. I know I ought to do it with
url_part_aliases 8-)

regards,
        Malcolm.

 [EMAIL PROTECTED]     http://users.ox.ac.uk/~malcolm/


_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html
Re: [htdig] what docs are indexed?

Reply via email to