Hi David,
Thanks for the suggestion. However, there are two problems as far
as I can tell the following setup that you suggested:
start_url: http://www.mail-archive.com/[email protected]/maillist.html
limit_urls_to: .html
exclude_urls: .txt
The first problem is that by following links, htdig might climb right
out of the subdirectory I want to index and start capturing all the
html files the server, right? That would be bad because there
are a lot of irrelevant files on the server.
The second issue is that I have no idea what file extensions
attachments will have -- they could be anything from .txt to
gobbedlygood for all I know. It's the one that I'm not expecting
that I am most afraid of.
I have appended the non-cosmetic portion of the htdig configuration
file that I use, in case it is helpful. By the way, earlier verisons
of MHonArc stored attachments slightly differently, so I've never
had this conundrum before.
Jeff
-----------------------------------------
# HTDIG configuration file.
# Automaticly generated. Do not edit.
start_url: http://www.mail-archive.com/[email protected]/maillist.html
database_dir: /home/archive/vault/gossip_jab_org
bad_word_list: /home/archive/conf/bad_words.txt
nothing_found_file: /home/archive/conf/nomatch.html
search_results_wrapper: /home/archive/conf/wrapper.html
limit_urls_to: http://www.mail-archive.com/[email protected]/msg
exclude_urls: .mhonarc.db .htaccess
max_head_length: 10000
remove_bad_urls: true
use_star_image: no
maintainer: [EMAIL PROTECTED]
search_algorithm: exact:1
allow_virtual_hosts: true
allow_numbers: true
no_next_page_text:
no_prev_page_text:
backlink_factor: 0
local_urls: \
http://www.mail-archive.com/[email protected][EMAIL PROTECTED]/
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.