Still working on processing the old email digests to convert them to individual emails in mbox format for import.

Meanwhile, though, I thought of a new issue which has to do with identifying the list name from the email headers, given that the list name (and consequently the email address in the To: line) varies.

The list is currently identified in the To: line as tang...@mit.edu and the last 10 years' archives should be consistent (that's the easy part).

Before that, though, it was hosted with Listserv software on a different host and was identified as tang...@mitvma.mit.edu which in some very old digests ca. 1994 show up as tango-l%mitvma.bit...@uga.cc.uga.edu (the uga.edu part could vary depending on what BITNET-to-Internet gateway was used). In addition, the list was called TANGO for a short while before it was changed to TANGO-L. In summary, we have the following, which are all the same list:

tang...@mit.edu
tang...@mitvma.mit.edu
tango-l%mitvma.bit...@uga.cc.uga.edu (with possibly gateways other than uga.edu)
ta...@mitvma.mit.edu
tango%mitvma.bit...@uga.cc.uga.edu (with possibly gateways other than uga.edu)

So, with this background, my questions are:

1. Since this is for a manual archive import (as opposed to incoming email that has to be filtered intelligently), you would have all the files and could presumably just force it to go to the same database. Is this true?

2. Even if you did that, though, would searches work reliably? I would want the first of the above addresses (tang...@mit.edu) to be the primary address by which the list would be identified, so for search queries, people would always identify the list as tang...@mit.edu, as that's what it's been for the last 10 years. But would ALL the above variants then be included in the search? I.e., is there an internal tag created with the archive that identifies it with just one email address for queries, regardless of what's on the "To:" field of an individual message? No one is going to be searching for this list with anything other than "Tango-L" (most likely) or "tang...@mit.edu", notwithstanding the alternate forms.

3. Any other things to think about regarding this issue? I have resigned myself to writing scripts for splitting some old digests into mbox-format individual emails for import, so if there is something else I need to on each message to address this new issue, I could incorporate it into the script. I could just force all the "To:" headers to tang...@mit.edu, obliterating all the previous forms, which clearly would solve the problem, but do I need to? (It is extra work, and some are already in mbox format that I would otherwise not need to touch, but would to make this change.)

4. If, in the future, the list were to move to a different host and have an address like tang...@list.tango-l.com (a "permanent" change) how could I ensure that it continued to go to the same archive? And is there would be a single list tag in a search query that would get posts from that archive (and only that archive) regardless of whether they posts were pre-move or post-move? In this hypothetical situation indeed people may search with the mit.edu domain OR tango-L.com domain, and ideally it would call up the same combined archive.

(I did read the FAQ article on "My list splits into multiple archives" which touches on this issue, but it seems relevant mostly to incoming mail filtering rather than old archive processing.)

Shahrukh

On 4/15/2015 2:29 AM, Jeff Breidenbach wrote:
Statute of limitations is typically 3 kilomessages on a normal
non-import list, but should (I think) be unlimited on bulk import.
Conversion to unix newlines is required and is manual; doesn't
matter who does it.

Still prefer to do whole import at once especially if tricky; less
labor, also less likely to break URLs if it takes multiple attempts
to get it right.  But we can accommodate two stages. Imports are
done on weekends only.








_______________________________________________
Gossip mailing list
https://www.mail-archive.com/gossip@mail-archive.com
https://www.mail-archive.com/cgi-bin/mailman/options/gossip

Reply via email to