>
> If it is including quoted passages, a simple way to address this is to
> remove any line starting with '>' and all attachments.
>

That is what I was planning to do.  I was referring to it as a problem in
reference to incidence.

I am currently working on a python implementation that strips headers and
quoted passages.  One problem I have discovered is that the gzip'd archives
often contain multiple copies of the same message (matching "message-id"s in
the header).  I am removing duplicates and the count after this operation
matched the count when viewed online in the archives.

-Peter
_______________________________________________
foundation-l mailing list
[email protected]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

Reply via email to