According to Bill Paxton: > This is probably an old issue but I cannot search the > archives efficiently on these keywords. > > Anyway, when searching a hypermail archive with htdig > the results are pretty useless. The subject needs to > be where the page title is (or only a partial subj > line), the date needs to be the email date not the > .html file's date, and the first xxx lines need to be > filtered from displaying and being indexed. > > With the messages where the mail file is still around, > I can have hypermail regenerate the htmls as needed;
If you can regenerate the HTML files as needed, the fixes for these 3 problems are: 1) put the subject of each message between <title> and </title> tags in the <head> of each HTML page, 2) put the e-mail date in a <meta name="date"> tag, in ISO-8601 format, and use the use_doc_date attribute, and 3) use additional tags to separate out sections not to be indexed (see http://www.htdig.org/FAQ.html#q4.15). However... > but many thousands are just in html - the mbox is long > since deleted. > > And when you're searching a few hundred thousand > messages, it's ugly when all of the relevant hits are > hundreds down and don't "make the cut" because of the > repeated subject/title line (and weighting depends on > how much some emails were re-re-re-quoted) Well, given the large number of messages in HTML only, I see two options. Both involve writing a script of some sort to make all the 3 above changes to the HTML files. The two options are to run this script on all the files, replacing them in your collection with the modified ones, or to make htdig run this script on the fly as it's indexing your hypermail archive. To do it on the fly, you'd need to write the script as an external converter, which takes HTML input and outputs the modified HTML, and call it like so from your htdig.conf: external_parsers: text/html->text/html-internal /path/to/my/script Then, htdig would pass all text/html files through this script before it parses them itself using its internal HTML parser. > I searched contrib but couldn't find anything useful. > I know others must have done something like this > before, and I beg forgiveness for asking something I > know is answered... but have you ever tried searching > on "htdig hypermail archive" ? Yeah, I see your point. "hypermail" comes up an awful lot, especially since we used hypermail for several years for our archives, before our move to SourceForge and its Geocrawler archives. > Come to think of it, the htdig.org archive results > page looks pretty messed up, so maybe it's not "fixed" > > If I'm using the wrong search package for this, feel > free to suggest another. Going crazy here. I'm not sure what you mean by our archive results being messed up. Certainly, the first problem is solved for our archives, as the message subject lines do come up as the page titles in searches. I don't think that had ever been a problem for us, so I was actually surprised when you mentioned your hypermail archives didn't do this. The modification dates do reflect message dates, or at least archival dates, for our older hypermail archives. For the GeoCrawler archives, we haven't yet found a solution for getting the date set, as I haven't written an on-the-fly external converter for this task. As for excluding certain sections of each message page, no, we haven't done that yet either. This actually seems to be less of a problem for Geocrawler than it was for hypermail. For hypermail, we could always reindex with noindex_start: <!-- next="start" --> noindex_end: <!-- body="start" --> which would eliminate all the Next message: ... stuff from both places where it appears. The first section would end when the body starts, and for the second one, htdig wouldn't find the end, so it should ignore that up to the end of the page. Are there other problems you've spotted than the two I've mentioned (i.e. Mod. dates for GeoCrawler, Next message stuff for hypermail)? -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) ------------------------------------------------------- This sf.net email is sponsored by: OSDN - Tired of that same old cell phone? Get a new here for FREE! https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390 _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

