Paul Querna wrote:
robert burrell donkin wrote:
On 7/2/06, Sam Ruby <[EMAIL PROTECTED]> wrote:
robert burrell donkin wrote:
the mailing list archives at apache run on mod_mbox which also supplies
atom
feeds for these lists. i've added the feed from general to the front
page
and think it'd be cool to add feeds to the pages in projects as well.
since
the focus of  podlings should be recruiting developers (not users) i'm
thinking of adding feeds to the dev lists.

opinions?

volunteers?
Just be aware that the feeds produced are rarely well formed XML, mostly
due to encoding issues.  For example: http://tinyurl.com/h5f7t

I tried to submit a patch based on my limited understanding of the code,
and was told that my patch wasn't acceptable

To be clear, AFAIK, there was never a patch for mod_mbox -- it was a
Ruby file that only solved part of the problem. Again, AFAIK, no one
ever wrote a patch in C for mod_mbox to attempt to resolve this issue.

I offered.  The response was, and I quote, "Erm, no".

and that XML parsers that
require well-formedness were broken anyway -- despite that being
explicitly what the spec requires.

Its unfortunate that the discussion degraded into that.

I'd be willing to try again, but only if there was active interest in
actually fixing the problem.

Yes, there is active interest in making mod_box better.

This thread was in October, and since then the feed has not improved.

IMO we should fix the feed but i'm not involved with mod_mbox (or httpd).
anyone who is want to jump in here?

The primary bug is lack of encoding support.  mod-mbox just doesn't even
try to do it.

Someone needs to write something that touches many parts of the code,
using the apr_xlate API to convert the content to utf-8.  (This would
also help it validate as HTML).  Once that is done, we do need to worry
about out of range characters, some of which would be removed, others
possibly HTML encoded.

Inside the message, there may be a content-type header. Inside this header, there may be a charset parameter. This charset parameter may be quoted, or it may not. It may be correct, or it may not.

It would be worthwhile to attempt to extract this, and to attempt to convert at least the body portion of the message to utf-8.

But in any case, the results after the conversion need to be sanitized.
The Ruby code that I offered to convert to C does exactly that - takes
something that is allegedly utf-8 and corrects a number of common
errors, and produces something that is guaranteed to be well formed.  Of
course, if you feed in absolute garbage, what you will get back is well
formed line noise.

As promised, here is a C version that does approximately the same thing:

http://intertwingly.net/stories/2006/07/04/clean_utf8_for_xml.c

This may be useful in display_atom_entry, and mbox_static_message, mbox_xml_message. It is safer than using <!CDATA[ ]]> as email messages (such as this one) may contain such strings.

Also note that if the content_type of the original MIME message contains the string "html", you might want to adjust the type attribute on the atom:content element accordingly.

But back to the original point: even if nobody puts in the effort to correctly interpret that message based on the specified charset, the addition of this code or something similar is (1) necessary anyway, (2) will make the result no worse than it currently is and has been for months, and (3) will make a marked improvement in that it will correct a number of common errors.

Please feel free to treat the code mentioned above as being under the Apache Software License version 2.0. If you don't like my indentation or bracing style, by all means, adjust it to your tastes. Convert the malloc to use the appropriate apr call. Or if you prefer, throw it all away, and start over. I don't care, I just want to see the Atom feeds produced to be clean and valid.

For future discussion of this please use [EMAIL PROTECTED]

OK

Thanks,

-Paul

- Sam Ruby

Reply via email to