Hi Mark, First of all, thanks for your time. Your help is very valuable.
see below for my comments and questions: On Sat, Dec 18, 2010 at 4:31 AM, Mark Sapiro <m...@msapiro.net> wrote: > On 12/17/2010 4:55 AM, Lukáš Vlček wrote: > > > > I am looking at a best practice way how to integrate mailman with > external > > search engine. I found the following Wiki page [1] which contains a link > to > > Ext_Arch.py template which is brainchild of Mark Sapiro and Cedric > Jeanneret > > [2]. Cerdic was after indexing emails using Xapian and his implementation > of > > the Ext_Arch.py can be found here [3]. This all looks very promising but > I > > have a few questions/concerns: > > > > To me it seems that the PUBLIC_EXTERNAL_ARCHIVER and > > PRIVATE_EXTERNAL_ARCHIVER commands (which are both set in mm_cfg.py) are > > executed only when a new message arrives, that means it is not executed > when > > bin/arch is executed. This means that if there has been running some mail > > list on mailman for a few years now and now I would like to allow > searching > > its content via new external search engine (like Xapian) it is simply no > > enough to add external archiver and restart mailman because this would > index > > only newly added messages. Am I right? > > > Yes, you are right. The design intent of external archivers is that they > provide a hook to use an external process for both archiving and > searching of the external archive. External archivers were never > intended to be used to index the built-in pipermail archive. Thus, the > Ext_Arch.py template is just a kludge which is admittedly incomplete in > this respect. > > > > How can I then have reindexed old content from that mail list into Xapian > as > > well? bin/arch <maillist> does not do that as it does not execute > external > > archivers. Moreover, running bin/arch can change URLs of individual > public > > emails (re-numbering) and that is probably unacceptable. So is there any > way > > how to iterate over existing emails, parse them and get an existing URL > > value for them? (Such information could be then used to re-index old > content > > into external search server without need to run bin/arch). > > > find /path/to/archives/private/LISTNAME \ > | egrep "[0-9]{6}.html" \ > | sed "s;.*archives/private;http://www.example.com/pipermail;" > > with the obvious modification will get the URLs. Will that be enough? > Not exactly. I need to index mail list content by external search server and for each indexed mail I need to know working mailman public URL of that mail. Ext_Arch.py allows me to hook into archiving process and gives me a chance to index content of newly added mails and also gives me public URL for them. That is nice but it does not give me a chance to learn URL for existing mails that are already in mbox file. My question is: if I take the <list-name>.mbox file is there any way how I can deduce working URL of individual emails? Say I can split the mbox file using: csplit -s -b %06d.mbox -z <list-name>.mbox '/^From /' {*} into individual emails. Would the numbering be the same as the one produced by mailman in this case? (Providing mailman numbering starts from zero) I learned that if I use this csplit technique with public archives then the numbering is not guarantied to match (the order in which the mails are stored in public archives does not match the numbering order of mailman produced HTML files). Moreover public archive files do not contain all the email headers (charset, encoding, content-type, ...) and I don't want to index generated HTML files for now. Thanks a lot, Lukas > > > > > [1] > > > http://wiki.list.org/display/DOC/4.87+How+do+I+invoke+some+process+on+messages+as+they+are+added+to+the+pipermail+archive > > [2] http://www.mail-archive.com/mailman-users@python.org/msg56679.html > > [3] > > > https://bugs.launchpad.net/mailman/+bug/531942/+attachment/1199211/+files/archive-and-index.py > > -- > Mark Sapiro <m...@msapiro.net> The highway is for gamblers, > San Francisco Bay Area, California better use your sense - B. Dylan > > ------------------------------------------------------ Mailman-Users mailing list Mailman-Users@python.org http://mail.python.org/mailman/listinfo/mailman-users Mailman FAQ: http://wiki.list.org/x/AgA3 Security Policy: http://wiki.list.org/x/QIA9 Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-users/archive%40jab.org