Hi David,

Thanks for the suggestion. However, there are two problems as far 
as I can tell the following setup that you suggested:

start_url:        http://www.mail-archive.com/[email protected]/maillist.html
limit_urls_to:    .html
exclude_urls:     .txt

The first problem is that by following links, htdig might climb right
out of the subdirectory I want to index and start capturing all the
html files the server, right? That would be bad because there
are a lot of irrelevant files on the server.

The second issue is that I have no idea what file extensions
attachments will have -- they could be anything from .txt to
gobbedlygood for all I know. It's the one that I'm not expecting
that I am most afraid of.

I have appended the non-cosmetic portion of the htdig configuration
file that I use, in case it is helpful. By the way, earlier verisons
of MHonArc stored attachments slightly differently, so I've never
had this conundrum before.

Jeff

-----------------------------------------

# HTDIG configuration file.
# Automaticly generated. Do not edit.

start_url:        http://www.mail-archive.com/[email protected]/maillist.html
database_dir:           /home/archive/vault/gossip_jab_org
bad_word_list:          /home/archive/conf/bad_words.txt
nothing_found_file:     /home/archive/conf/nomatch.html
search_results_wrapper: /home/archive/conf/wrapper.html
limit_urls_to:        http://www.mail-archive.com/[email protected]/msg
exclude_urls:         .mhonarc.db .htaccess
max_head_length:      10000
remove_bad_urls:      true
use_star_image:       no
maintainer:           [EMAIL PROTECTED]
search_algorithm:     exact:1
allow_virtual_hosts:  true
allow_numbers:        true
no_next_page_text:
no_prev_page_text:
backlink_factor:      0

local_urls: \
 http://www.mail-archive.com/[email protected][EMAIL PROTECTED]/

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.

Reply via email to