Re: [htdig] hypermail parsing, revisited

Gilles Detillieux Mon, 23 Sep 2002 14:30:06 -0700

Well, this discussion and my frustration with the htdig mail archive index
prompted me to do something about it.  I decided to pursue my idea of an
external converter that fixes up the archived messages on the fly, and it
turns out to be a very flexible way of dealing with matters.  The script
can handle separate sets of changes for the old hypermail archives and the
newer Geocrawler archives, and it can overcome the limitation of having
only one set of strings in noindex_start and noindex_end.  Using sed
in the script, you can do all sorts of edits and deletions on the fly.
One of these edits is to get rid of the "link farms" from the index:
lists of messages are followed for links, but not indexed.


I also developed an archive update script which does daily updates of
the index, reindexing only the last month's messages, so it's very quick
(much faster than a full update dig which would have to check all URLs
for updates).

These scripts have really helped to clean up the index of our mailing
lists.  You may notice that searches on www.htdig.org now produce less
clutter, and the modification dates are appropriate for when the messages
were received.

Realizing this would be a useful thing for other users, as requests
for solutions to both problems have come up more than once on the
mailing lists, I decided to make the two scripts available in
the contrib section of the htdig.org web site.  See the files
README.geoupdate-ungeoify, geoupdate.sh and ungeoify.sh in
http://www.htdig.org/files/contrib/scripts/

Hope this can help you and/or others.

On Sep 6, Bill Paxton wrote:
> I sort of figured. I think it will be quicker for me
> to write a script to generate a reconstructed mbox
> from the htmls; then, once it's finally done I can use
> hypermail to regenerate the htmls at will with any
> info needed like the META fields. It's an either/or
> situation, and it's going to be messy for a while. I'm
> gonna have to fake-generate msgids and make sure they
> don't... ugh, work work.
> 
> re: messed up search results, ah yes, I was referring
> to the info that fills up the allocated excerpt space
> being useless to the message content, stuff like:
> 
> Geo: "htdig-general - Archive  2002 (1911 msgs)  2001
> (3617 msgs)  1998 (1 msgs) Thread: [htdig]"
> 
> Hyp: "2000 - 08:02:30 PST Archived on: Sun Dec 31 2000
> - 00:02:32 PST 64 messages sorted by: [ author ] [
> date ] [ subject ] This archive was generated by
> hypermail 2b28"
> 
> I know that's going to require a bit more parsing for
> a fairly convoluted string-stripping script to do it
> reliably; which is why I thought it might exist
> already. (I've posted a similar post to the hypermail
> list just in case)
> 
> Perfect would be subject as the link, trimmed of
> majordomo addition, followed by author, date, and then
> the first lines of the email itself (in long)
> 
> Ah well, sometimes there is no brillant shortcut -
> just hard work. A basic shell script and a box in the
> corner chugging away for days. :-)
> 
> I want to be able to highlight the search words in the
> returned pages as well. Learning search_rewrite_rules
> and writing/finding the script to power the
> highlighting can be something to work on while the
> thing rebuilds the mbox.
> 
> Thanks again. Much to do.
> 
> --------
> 
> --- Gilles Detillieux <[EMAIL PROTECTED]>
> wrote:
> > According to Bill Paxton:
> > > This is probably an old issue but I cannot search
> > the
> > > archives efficiently on these keywords.
> > > 
> > > Anyway, when searching a hypermail archive with
> > htdig
> > > the results are pretty useless. The subject needs
> > to
> > > be where the page title is (or only a partial subj
> > > line), the date needs to be the email date not the
> > > .html file's date, and the first xxx lines need to
> > be
> > > filtered from displaying and being indexed.
> > > 
> > > With the messages where the mail file is still
> > around,
> > > I can have hypermail regenerate the htmls as
> > needed;
> > 
> > If you can regenerate the HTML files as needed, the
> > fixes for these 3
> > problems are:  1) put the subject of each message
> > between <title> and
> > </title> tags in the <head> of each HTML page, 2)
> > put the e-mail date
> > in a <meta name="date"> tag, in ISO-8601 format, and
> > use the use_doc_date
> > attribute, and 3) use additional tags to separate
> > out sections not to
> > be indexed (see
> > http://www.htdig.org/FAQ.html#q4.15).  However...
> > 
> > > but many thousands are just in html - the mbox is
> > long
> > > since deleted.
> > > 
> > > And when you're searching a few hundred thousand
> > > messages, it's ugly when all of the relevant hits
> > are
> > > hundreds down and don't "make the cut" because of
> > the
> > > repeated subject/title line (and weighting depends
> > on
> > > how much some emails were re-re-re-quoted)
> > 
> > Well, given the large number of messages in HTML
> > only, I see two options.
> > Both involve writing a script of some sort to make
> > all the 3 above changes
> > to the HTML files.  The two options are to run this
> > script on all the files,
> > replacing them in your collection with the modified
> > ones, or to make htdig
> > run this script on the fly as it's indexing your
> > hypermail archive.  To do
> > it on the fly, you'd need to write the script as an
> > external converter,
> > which takes HTML input and outputs the modified
> > HTML, and call it like so
> > from your htdig.conf:
> > 
> > external_parsers:   text/html->text/html-internal
> > /path/to/my/script
> > 
> > Then, htdig would pass all text/html files through
> > this script before it
> > parses them itself using its internal HTML parser.
> > 
> > > I searched contrib but couldn't find anything
> > useful.
> > > I know others must have done something like this
> > > before, and I beg forgiveness for asking something
> > I
> > > know is answered... but have you ever tried
> > searching
> > > on "htdig hypermail archive" ?
> > 
> > Yeah, I see your point.  "hypermail" comes up an
> > awful lot, especially
> > since we used hypermail for several years for our
> > archives, before our
> > move to SourceForge and its Geocrawler archives.
> > 
> > > Come to think of it, the htdig.org archive results
> > > page looks pretty messed up, so maybe it's not
> > "fixed"
> > > 
> > > If I'm using the wrong search package for this,
> > feel
> > > free to suggest another. Going crazy here.
> > 
> > I'm not sure what you mean by our archive results
> > being messed up.
> > Certainly, the first problem is solved for our
> > archives, as the message
> > subject lines do come up as the page titles in
> > searches.  I don't think
> > that had ever been a problem for us, so I was
> > actually surprised when
> > you mentioned your hypermail archives didn't do
> > this.
> > 
> > The modification dates do reflect message dates, or
> > at least archival
> > dates, for our older hypermail archives.  For the
> > GeoCrawler archives,
> > we haven't yet found a solution for getting the date
> > set, as I haven't
> > written an on-the-fly external converter for this
> > task.  As for excluding
> > certain sections of each message page, no, we
> > haven't done that yet
> > either.  This actually seems to be less of a problem
> > for Geocrawler than
> > it was for hypermail.  For hypermail, we could
> > always reindex with
> > 
> > noindex_start:      <!-- next="start" -->
> > noindex_end:        <!-- body="start" -->
> > 
> > which would eliminate all the Next message: ...
> > stuff from both places
> > where it appears.  The first section would end when
> > the body starts, and
> > for the second one, htdig wouldn't find the end, so
> > it should ignore that
> > up to the end of the page.
> > 
> > Are there other problems you've spotted than the two
> > I've mentioned
> > (i.e. Mod. dates for GeoCrawler, Next message stuff
> > for hypermail)?


-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Re: [htdig] hypermail parsing, revisited

Reply via email to