Re: [htdig] Simple htdig question

Gilles Detillieux Wed, 02 Oct 2002 07:06:53 -0700

According to Stefan Seiz:
> How would I configure htdig to spider .html pages but to ONLY INDEX .pdf
> files?
> 
> I have played with the valid-extension flag, but when I set it to only .pdf
> it doesn't seem to spider .html anymore...


According to Douglas S. Davis:
> You put in a tag to follow/no index on each of the html pages.  I do that
> with my master tables so that only the pages are indexed and not the tables.

If it's too much effort to add meta robots tags to all your HTML pages,
or if you don't want to block ALL robots from these pages, then another
way would be like so:

external_parsers: text/html->text/html-internal /path/to/htmlnoindex.sh \
                  application/pdf->text/html-internal /path/to/doc2html.pl

where doc2html.pl is your usual external converter script for PDFs (but
note the addition of "-internal" to text/html so the doc2html.pl output
doesn't go through htmlnoindex.sh), and htmlnoindex.sh is as follows:

#!/bin/sh
echo '<meta name="robots" content="noindex,follow">'
cat "$1"

This script automatically adds the meta robots tag to the start of all
real text/html files as they're being indexed.

As you've figured out, putting .pdf in valid_extensions will prevent htdig
from even looking at .html files for links.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Re: [htdig] Simple htdig question

Reply via email to