According to Bill Paxton:
> This is probably an old issue but I cannot search the
> archives efficiently on these keywords.
> 
> Anyway, when searching a hypermail archive with htdig
> the results are pretty useless. The subject needs to
> be where the page title is (or only a partial subj
> line), the date needs to be the email date not the
> .html file's date, and the first xxx lines need to be
> filtered from displaying and being indexed.
> 
> With the messages where the mail file is still around,
> I can have hypermail regenerate the htmls as needed;

If you can regenerate the HTML files as needed, the fixes for these 3
problems are:  1) put the subject of each message between <title> and
</title> tags in the <head> of each HTML page, 2) put the e-mail date
in a <meta name="date"> tag, in ISO-8601 format, and use the use_doc_date
attribute, and 3) use additional tags to separate out sections not to
be indexed (see http://www.htdig.org/FAQ.html#q4.15).  However...

> but many thousands are just in html - the mbox is long
> since deleted.
> 
> And when you're searching a few hundred thousand
> messages, it's ugly when all of the relevant hits are
> hundreds down and don't "make the cut" because of the
> repeated subject/title line (and weighting depends on
> how much some emails were re-re-re-quoted)

Well, given the large number of messages in HTML only, I see two options.
Both involve writing a script of some sort to make all the 3 above changes
to the HTML files.  The two options are to run this script on all the files,
replacing them in your collection with the modified ones, or to make htdig
run this script on the fly as it's indexing your hypermail archive.  To do
it on the fly, you'd need to write the script as an external converter,
which takes HTML input and outputs the modified HTML, and call it like so
from your htdig.conf:

external_parsers:       text/html->text/html-internal /path/to/my/script

Then, htdig would pass all text/html files through this script before it
parses them itself using its internal HTML parser.

> I searched contrib but couldn't find anything useful.
> I know others must have done something like this
> before, and I beg forgiveness for asking something I
> know is answered... but have you ever tried searching
> on "htdig hypermail archive" ?

Yeah, I see your point.  "hypermail" comes up an awful lot, especially
since we used hypermail for several years for our archives, before our
move to SourceForge and its Geocrawler archives.

> Come to think of it, the htdig.org archive results
> page looks pretty messed up, so maybe it's not "fixed"
> 
> If I'm using the wrong search package for this, feel
> free to suggest another. Going crazy here.

I'm not sure what you mean by our archive results being messed up.
Certainly, the first problem is solved for our archives, as the message
subject lines do come up as the page titles in searches.  I don't think
that had ever been a problem for us, so I was actually surprised when
you mentioned your hypermail archives didn't do this.

The modification dates do reflect message dates, or at least archival
dates, for our older hypermail archives.  For the GeoCrawler archives,
we haven't yet found a solution for getting the date set, as I haven't
written an on-the-fly external converter for this task.  As for excluding
certain sections of each message page, no, we haven't done that yet
either.  This actually seems to be less of a problem for Geocrawler than
it was for hypermail.  For hypermail, we could always reindex with

noindex_start:  <!-- next="start" -->
noindex_end:    <!-- body="start" -->

which would eliminate all the Next message: ... stuff from both places
where it appears.  The first section would end when the body starts, and
for the second one, htdig wouldn't find the end, so it should ignore that
up to the end of the page.

Are there other problems you've spotted than the two I've mentioned
(i.e. Mod. dates for GeoCrawler, Next message stuff for hypermail)?

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This sf.net email is sponsored by: OSDN - Tired of that same old
cell phone?  Get a new here for FREE!
https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to