Still working on processing the old email digests to convert them to
individual emails in mbox format for import.
Meanwhile, though, I thought of a new issue which has to do with
identifying the list name from the email headers, given that the list
name (and consequently the email address in the To: line) varies.
The list is currently identified in the To: line as tang...@mit.edu and
the last 10 years' archives should be consistent (that's the easy part).
Before that, though, it was hosted with Listserv software on a different
host and was identified as tang...@mitvma.mit.edu which in some very old
digests ca. 1994 show up as tango-l%mitvma.bit...@uga.cc.uga.edu (the
uga.edu part could vary depending on what BITNET-to-Internet gateway was
used). In addition, the list was called TANGO for a short while before
it was changed to TANGO-L. In summary, we have the following, which are
all the same list:
tang...@mit.edu
tang...@mitvma.mit.edu
tango-l%mitvma.bit...@uga.cc.uga.edu (with possibly gateways other than
uga.edu)
ta...@mitvma.mit.edu
tango%mitvma.bit...@uga.cc.uga.edu (with possibly gateways other than
uga.edu)
So, with this background, my questions are:
1. Since this is for a manual archive import (as opposed to incoming
email that has to be filtered intelligently), you would have all the
files and could presumably just force it to go to the same database. Is
this true?
2. Even if you did that, though, would searches work reliably? I would
want the first of the above addresses (tang...@mit.edu) to be the
primary address by which the list would be identified, so for search
queries, people would always identify the list as tang...@mit.edu, as
that's what it's been for the last 10 years. But would ALL the above
variants then be included in the search? I.e., is there an internal tag
created with the archive that identifies it with just one email address
for queries, regardless of what's on the "To:" field of an individual
message? No one is going to be searching for this list with anything
other than "Tango-L" (most likely) or "tang...@mit.edu", notwithstanding
the alternate forms.
3. Any other things to think about regarding this issue? I have resigned
myself to writing scripts for splitting some old digests into
mbox-format individual emails for import, so if there is something else
I need to on each message to address this new issue, I could incorporate
it into the script. I could just force all the "To:" headers to
tang...@mit.edu, obliterating all the previous forms, which clearly
would solve the problem, but do I need to? (It is extra work, and some
are already in mbox format that I would otherwise not need to touch, but
would to make this change.)
4. If, in the future, the list were to move to a different host and have
an address like tang...@list.tango-l.com (a "permanent" change) how
could I ensure that it continued to go to the same archive? And is there
would be a single list tag in a search query that would get posts from
that archive (and only that archive) regardless of whether they posts
were pre-move or post-move? In this hypothetical situation indeed people
may search with the mit.edu domain OR tango-L.com domain, and ideally it
would call up the same combined archive.
(I did read the FAQ article on "My list splits into multiple archives"
which touches on this issue, but it seems relevant mostly to incoming
mail filtering rather than old archive processing.)
Shahrukh
On 4/15/2015 2:29 AM, Jeff Breidenbach wrote:
Statute of limitations is typically 3 kilomessages on a normal
non-import list, but should (I think) be unlimited on bulk import.
Conversion to unix newlines is required and is manual; doesn't
matter who does it.
Still prefer to do whole import at once especially if tricky; less
labor, also less likely to break URLs if it takes multiple attempts
to get it right. But we can accommodate two stages. Imports are
done on weekends only.
_______________________________________________
Gossip mailing list
https://www.mail-archive.com/gossip@mail-archive.com
https://www.mail-archive.com/cgi-bin/mailman/options/gossip