Re: [Gossip] Porting digested new list archives to mail-archive

Shahrukh Merchant Fri, 17 Apr 2015 07:30:28 -0700

Still working on processing the old email digests to convert them toindividual emails in mbox format for import.

Meanwhile, though, I thought of a new issue which has to do withidentifying the list name from the email headers, given that the listname (and consequently the email address in the To: line) varies.

The list is currently identified in the To: line as tang...@mit.edu andthe last 10 years' archives should be consistent (that's the easy part).

Before that, though, it was hosted with Listserv software on a differenthost and was identified as tang...@mitvma.mit.edu which in some very olddigests ca. 1994 show up as tango-l%mitvma.bit...@uga.cc.uga.edu (theuga.edu part could vary depending on what BITNET-to-Internet gateway wasused). In addition, the list was called TANGO for a short while beforeit was changed to TANGO-L. In summary, we have the following, which areall the same list:


tang...@mit.edu
tang...@mitvma.mit.edu

tango-l%mitvma.bit...@uga.cc.uga.edu (with possibly gateways other thanuga.edu)

ta...@mitvma.mit.edu

tango%mitvma.bit...@uga.cc.uga.edu (with possibly gateways other thanuga.edu)


So, with this background, my questions are:

1. Since this is for a manual archive import (as opposed to incomingemail that has to be filtered intelligently), you would have all thefiles and could presumably just force it to go to the same database. Isthis true?

2. Even if you did that, though, would searches work reliably? I wouldwant the first of the above addresses (tang...@mit.edu) to be theprimary address by which the list would be identified, so for searchqueries, people would always identify the list as tang...@mit.edu, asthat's what it's been for the last 10 years. But would ALL the abovevariants then be included in the search? I.e., is there an internal tagcreated with the archive that identifies it with just one email addressfor queries, regardless of what's on the "To:" field of an individualmessage? No one is going to be searching for this list with anythingother than "Tango-L" (most likely) or "tang...@mit.edu", notwithstandingthe alternate forms.

3. Any other things to think about regarding this issue? I have resignedmyself to writing scripts for splitting some old digests intombox-format individual emails for import, so if there is something elseI need to on each message to address this new issue, I could incorporateit into the script. I could just force all the "To:" headers totang...@mit.edu, obliterating all the previous forms, which clearlywould solve the problem, but do I need to? (It is extra work, and someare already in mbox format that I would otherwise not need to touch, butwould to make this change.)

4. If, in the future, the list were to move to a different host and havean address like tang...@list.tango-l.com (a "permanent" change) howcould I ensure that it continued to go to the same archive? And is therewould be a single list tag in a search query that would get posts fromthat archive (and only that archive) regardless of whether they postswere pre-move or post-move? In this hypothetical situation indeed peoplemay search with the mit.edu domain OR tango-L.com domain, and ideally itwould call up the same combined archive.

(I did read the FAQ article on "My list splits into multiple archives"which touches on this issue, but it seems relevant mostly to incomingmail filtering rather than old archive processing.)


Shahrukh

On 4/15/2015 2:29 AM, Jeff Breidenbach wrote:

Statute of limitations is typically 3 kilomessages on a normal
non-import list, but should (I think) be unlimited on bulk import.
Conversion to unix newlines is required and is manual; doesn't
matter who does it.

Still prefer to do whole import at once especially if tricky; less
labor, also less likely to break URLs if it takes multiple attempts
to get it right.  But we can accommodate two stages. Imports are
done on weekends only.



_______________________________________________
Gossip mailing list
https://www.mail-archive.com/gossip@mail-archive.com
https://www.mail-archive.com/cgi-bin/mailman/options/gossip

Re: [Gossip] Porting digested new list archives to mail-archive

Reply via email to