> > If it is including quoted passages, a simple way to address this is to > remove any line starting with '>' and all attachments. >
That is what I was planning to do. I was referring to it as a problem in reference to incidence. I am currently working on a python implementation that strips headers and quoted passages. One problem I have discovered is that the gzip'd archives often contain multiple copies of the same message (matching "message-id"s in the header). I am removing duplicates and the count after this operation matched the count when viewed online in the archives. -Peter _______________________________________________ foundation-l mailing list [email protected] Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
