Darn, just when I thought I had reached charset-encoding guru state, I discover
I was mostly wrong.
I really love to be a coder...
> On Nov 15, 2005, at 11:31 AM, Gaetano Giunta wrote:
> > Very toughtful response.
> Man, I love cross-linguistic typos...makes great new English words:
> "toughtful" = "tough thoughtfulness". Brilliant.
I can do a lot better if you wish, mixing up italian, french, english and php
typos all in the same sentence ;)
> > UTF-8 everywhere is fine and dandy but for 2 aspects:
> > - in fact XML-over-http without a charset declaration SHOULD be
> > assumed to be ISO-8859-1 (there is a RFC somewhere about that,
> > which I cannot recall now).
> Hmmm. The XML 1.0 spec (http://www.w3.org/TR/2000/REC-xml-20001006)
> RFC 2376, however, offers suggestions for XML MIME-types sent over
> HTTP, but it reads (pardon the length):
OK, I'll admit I blew this one.
I cannot figure outh which RFC I (mis)read that convinced me that latin-1 was
the way to go for text/xml over http, but RFC 3023 is definitely THE reference
on this subject. And it states that
- a charset-encoding SHOULD be put in the http headers for interop's sake
- when that is unavailabe, xml MUST be treated as US-ASCII (regardless of the
> But I know that my RDFParser class, for example, defaults to "utf-8"
> and overrides that only if the encoding is specified as something
> else in the xml delaration. I assume I made that decision for good
> reasons, though I don't remember them now!
Most likely having bad sources of xml that send utf-8 stuff without declaring
it explicitly. Very annoying, but quite common, at least a little while ago.
> Still, the number of factors affecting encoding and transmission are
> unbelievably complex.
> and...ugh! Sometimes I just want to kill myself.
Yup, I only had the chance to prove myself with an arabic website once. It was
great fun, and source of a lot of learning, but it never went online (and the
translator refused to translate single phrases as I had specced, to be put in
the translation engine db, but insisted on giving me bak the 5 page translation
document without hinting at any separation of paragraphs...)
> While I suppose that attempting to convert all data into us-ascii
> through entity encoding gives us the "least common donominator"
> solution -- make everything 7-bit! -- it obviously isn't working
This is btw a 'road accident' not a by-design feature, and the previous
situation was wrong anyway.
The general solution (i.e. let the lib encode any internal charset to ascii) is
a bit daunting to be coded in php, but to add the 80% case (ie utf8 to ascii) I
think is quite easy. AND we are following the spec.
> So perhaps any solution that simply makes it work,
> regardless of whether or not it changes the use of
> $xmlrpc_internalencoding, would be good. I did wonder about the
> utf8_encode() function, and why you didn't simply use that
> instead of
> $character = ("&#".strval($code).";"); Won't that do all the right
> work for you?
Yes, provided that we added UTF-8 in the http headers.
No, in the current situation.
> In any case, I think you should try to make the XMLRPC
> library follow
> as closely as possible the relevant spec/RFC "recommended" behavior,
> and let that be your guide.
What I am currently thinking about is something along the lines:
1 - add support for xmlrpc_internalencoding in xmlrpc_encode_entities(), ONLY
for utf-8 to ascii, ascii-to-ascii and iso-8859-1 to ascii
2 - add support for specific charset encodings into xmlrpcmsg. If left
unspecified, defaults to us-ascii, as per the current behaviour. When
specified, it will modify the http content-type header, and potentially save a
lot of time while NOT encoding special chars into xml entities
3 - figure out wheter the response charset encoding should be left to decide to
the response object or to the server. Hint: the server can make intelligent
decisions based on the client's http headers (accepted-charset).
phpxmlrpc mailing list