On Sun, 20 Oct 2019, Geert Uytterhoeven wrote:

> Hi all,
> 
> I'm working to add this list to lore.kernel.org.

That's great news because lore.kernel.org is a search engine that actually 
works.

> As one of prerequisites they require that we provide full existing 
> archives of all list messages (or, at least, as complete as possible). 
> I've collected mine already, but would really appreciate if you could 
> pitch in from your own collection.
> 
> Just follow the instructions on this page:
> https://korg.wiki.kernel.org/userdoc/lore
> 

For anyone else attempting this, note that linux-m68k has two addresses, 
so you need to pass two '-l' parameters:
-l linux-m68k.vger.kernel.org linux-m68k.lists.linux-m68k.org

The above wiki page neglects to mention that the 'list-archive-maker.py' 
script has serious problems.

It can't deal with Alpine mboxes because they don't mangle "From" in 
message bodies as ">From". This leads to truncated messages.

I strongly recommend that you enable the '-r' parameter and then examine 
all of the rejected messages.

You'll also need to edit the script to avoid capturing rejected messages 
that they were rejected for obvious reasons (wrong list-id) rather than 
messed-up message boundary (i.e. a 'From ' mistakenly used as a message 
delimiter).

Another problem with that script is that it captures too much. It will 
grab messages that appear to be cross-posted (based on To: or Cc:) even if 
those messages never reached linux-m68k. I suppose the idea is that 
capturing too much is better than too little?

The script fabicates a missing List-ID header based on a guess. I don't 
know why it does this (bad idea from an archival perspective).

> I uploaded the list of message-ids that I already have to
> http://users.telenet.be/geertu/linux-m68k-message-ids.tar.xz
> You'll need it during the archive sanitization process to pass to the -k 
> switch.
> 
> Please tar up and xz -9 the resulting directory with mbox files and send
> the archive to me so I can add it to what I already have.
> 
> The archives I used, from my personal email collection, are:
>   1. [email protected] 680x0 channel digest (May 1993 - March 
> 1995)
>      Used initially.  Probably there was never a non-digest version?
>   2. [email protected] (Dec 1994 - Dec 1995)
>      First real mailing list.  Abandoned due to latency (most developers were
>      located in Europe and 2 Mbps transatlantic sucked).
>   3. [email protected] (Oct 1995 - Oct 2004)
>      Second mailing list. Abandoned due to spam and lack of admin activity.
>      I did my best to remove spam.
>   4. [email protected] (Oct 2004 - Current)
>      Current mailing list.
> As this is a single logical mailing list, the plan is to combine all of 
> it in a single archive.
> 
> My  archive should be fairly complete, except for network outages, and e.g.
> the Gandi email disaster week 2 years ago.  And I don't have anything from
> the real early days, unfortunately.
> 

I'll let you know if I find any missing messages here

> Note that sanitization script choked on some mails from the old 
> phil.uni-sb.de list, so it didn't succeed for me.
> 

Was that the "From" bug? I am experimenting with pre-processing of mboxes 
to substitute the "From" lines in the message bodies. Not yet sure if this 
will be entirely successful...

-- 

> Thanks!
> 
> Gr{oetje,eeting}s,
> 
>                         Geert
> 
> 

Reply via email to