On 12/18/2010 5:45 AM, Lukáš Vlček wrote: > > On Sat, Dec 18, 2010 at 4:31 AM, Mark Sapiro <m...@msapiro.net > <mailto:m...@msapiro.net>> wrote: > > find /path/to/archives/private/LISTNAME \ > | egrep "[0-9]{6}.html" \ > | sed "s;.*archives/private;http://www.example.com/pipermail;" > > with the obvious modification will get the URLs. Will that be enough? > > > Not exactly. I need to index mail list content by external search server > and for each indexed mail I need to know working mailman public URL of > that mail.
The above shell command will get you a list of the URLs. If you are saying you need to know the message content together with the URL, you could still do this easily from the existing pipermail archive. The point is that each individual message in the archive is in a file of the form archives/private/LISTNAME/yyyy-Month/nnnnnn.html and the LISTNAME/yyyy-Month/nnnnnn.html portion of that path is also the variable part of the URL used to access the message. > My question is: if I take the <list-name>.mbox file is there any way how > I can deduce working URL of individual emails? > Say I can split the mbox file using: > csplit -s -b %06d.mbox -z <list-name>.mbox '/^From /' {*} > into individual emails. Would the numbering be the same as the one > produced by mailman in this case? (Providing mailman numbering starts > from zero) It will be the same as the numbering produced by running bin/arch --wipe. As you note below, this is not guaranteed to be the same as that in the existing archive. > I learned that if I use this csplit technique with public archives then > the numbering is not guarantied to match (the order in which the mails > are stored in public archives does not match the numbering order of > mailman produced HTML files). Moreover public archive files do not > contain all the email headers (charset, encoding, content-type, ...) and > I don't want to index generated HTML files for now. If you really need information from the cummulative .mbox which is not available in the existing pipermail html files, I see two choices. If you don't want to rebuild the pipermail archive and possibly renumber messages, you will need to develop some script to go through the .mbox and parse the archive period (year/month or whatever the period is in your case) from the messages and search the nnnn.html files in that directory for a match. If you don't mind possibly renumbering messages, you could first check the .mbox with bin/cleanarch and then rebuild the archive from the .mbox with bin/arch --wipe, and then your csplit above will give the correct new numbers. Before rebuilding the archive however, you might check if the numbering in the mbox really doesn't match. While it is not guaranteed to match, it often does, particularly if the archive is not too old - i.e., if the oldest messages were archived by Mailman 2.1.x and not 2.0.x or older. -- Mark Sapiro <m...@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan ------------------------------------------------------ Mailman-Users mailing list Mailman-Users@python.org http://mail.python.org/mailman/listinfo/mailman-users Mailman FAQ: http://wiki.list.org/x/AgA3 Security Policy: http://wiki.list.org/x/QIA9 Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-users/archive%40jab.org