Re: [phpxmlrpc] xmlrpc_encode_entitites causing parse error

a.h.s. boy (lists) Tue, 15 Nov 2005 09:57:44 -0800

On Nov 15, 2005, at 11:31 AM, Gaetano Giunta wrote:

Very toughtful response.

Man, I love cross-linguistic typos...makes great new English words:"toughtful" = "tough thoughtfulness". Brilliant.

UTF-8 everywhere is fine and dandy but for 2 aspects:
- in fact XML-over-http without a charset declaration SHOULD beassumed to be ISO-8859-1 (there is a RFC somewhere about that,which I cannot recall now).

Hmmm. The XML 1.0 spec (http://www.w3.org/TR/2000/REC-xml-20001006)reads:

Because each XML entity not accompanied by external encodinginformation and not in UTF-8 or UTF-16 encoding MUST begin with anXML encoding declaration, in which the first characters must be '<?xml', any conforming processor can detect, after two to four octetsof input, which of the following cases apply.

RFC 2376, however, offers suggestions for XML MIME-types sent overHTTP, but it reads (pardon the length):


Although listed as an optional parameter, the use of the charset
      parameter is STRONGLY RECOMMENDED, since this information can be
      used by XML processors to determine authoritatively the character

encoding of the XML entity. The charset parameter can also beused

      to provide protocol-specific operations, such as charset-based
      content negotiation in HTTP.  "UTF-8" [RFC-2279] is the
      recommended value, representing the UTF-8 charset. UTF-8 is
      supported by all conforming XML processors [REC-XML].

      If the XML entity is transmitted via HTTP, which uses a MIME-like
      mechanism that is exempt from the restrictions on the text top-
      level type (see section 19.4.1 of HTTP 1.1 [RFC-2068]), "UTF-16"

(Appendix C.3 of [UNICODE] and Amendment 1 of [ISO-10646]) isalsorecommended. UTF-16 is supported by all conforming XMLprocessors[REC-XML]. Since the handling of CR, LF and NUL for texttypes in

      most MIME applications would cause undesired transformations of
      individual octets in UTF-16 multi-octet characters, gateways from

HTTP to these MIME applications MUST transform the XML entityfroma text/xml; charset="utf-16" to application/xml;charset="utf-16".


      Conformant with [RFC-2046], if a text/xml entity is received with
      the charset parameter omitted, MIME processors and XML processors
      MUST use the default charset value of "us-ascii".  In cases where
      the XML entity is transmitted via HTTP, the default charset value
      is still "us-ascii".

...which implies that us-ascii, not iso-8859-1, is the default (butnot really a problem if you're encoding everything outside of ASCII).But I know that my RDFParser class, for example, defaults to "utf-8"and overrides that only if the encoding is specified as somethingelse in the xml delaration. I assume I made that decision for goodreasons, though I don't remember them now!

Still, the number of factors affecting encoding and transmission areunbelievably complex. In my software, for example, there is:


1) Page encoding used when users submit data via a form (mine: UTF-8)
   a) Default charset header sent by Apache (mine:  UTF-8)
   b) Default charset set in META tags (mine: UTF-8)
   c) Charset setting of client browser (no control!)
2) Encoding of database (mine: MySQL 3.x, so limited to ISO-8859-1)

3) Encoding of page used to display data (Irrelevant to XML-RPCtransfers, but 1a,1b,1c apply)

4) PHP internal encoding
5) XMLRPC library internal encoding
6) XML declaration charset (optional, but highly recommended by spec)

7) text/xml MIME type charset declaration (optional, mine: text/xml;charset=utf-8)

8) application/xml MIME type charset declaration (optional)

...and since all of them could be set to different encodings, gettingit all straight is a dizzying adventure. Add to that the complexityof handling things like users copying text from a Word documentcreated in Windows-1252 and pasting into a form on a UTF-8 page,and...ugh! Sometimes I just want to kill myself.

While I suppose that attempting to convert all data into us-asciithrough entity encoding gives us the "least common donominator"solution -- make everything 7-bit! -- it obviously isn't workingperfectly. So perhaps any solution that simply makes it work,regardless of whether or not it changes the use of$xmlrpc_internalencoding, would be good. I did wonder about theutf8_encode() function, and why you didn't simply use that instead of$character = ("&#".strval($code).";"); Won't that do all the rightwork for you?

In any case, I think you should try to make the XMLRPC library followas closely as possible the relevant spec/RFC "recommended" behavior,and let that be your guide.

Adding some extra settings to client/server objects is fine, butthe causal user might not be used to using those, and backwardcompatability is a primary concern to me.Traduced in code that would probably mean adding some hacky stuffof the sort "object default charset preference is undefined, andwhile still undefined use global variable, otherwise use objectpreference" (doable but ugly).The though part is letting the client object communicate thedesired charset encoding to the xmlrpcval object, since theresponsibility of creating serialized content is left to thexmlrpcval object itself (and I'm surely not changing thatfundamental assumption).

If you converted $xmlrpc_internalencoding to a property of xmlrpcmsginstead of a global variable, then you could simply set it to defaultto "iso-8859-1" in the constructor method for the class object. Soyou maintain your default, but allow users to reset it throughscripting.

ps: the real (only ?) advantage of using variables instead ofconstnts for things such as internal_encoding is that you canredefine them not inside the xmlrpc lib but just after itsinclusion, eg.<?php include('xmlrpc.inc'); $xmlrpc_internal_encoding = 'UTF-16';echo 'etc...'; ?>
this way you do not have to change anything when updating...

Ah, yes, this is true, and I hadn't really thought of such a simplething (but the same method holds true for using an object property).


How the PEAR people are handling this:
http://pear.php.net/bugs/bug.php?id=52

["According to RFC 3023 section 3.1, the encoding specified in the <?xml encoding=... ?> tag should be ignored for XML received over HTTPin favor of the encoding specified in the Content-Type header (e.g."Content-Type: text/xml; charset=iso-8859-1)."]

I found another developer reflecting on these same questions, for ablogging app that uses XML-RPC:

http://ecto.kung-foo.tv/archives/000975.php

Other messages about the default encoding of unspecified xml documents:
http://groups.yahoo.com/group/xml-rpc/message/45

http://mail.zope.org/pipermail/zope-collector-monitor/2004-October/004361.html (in reverse chronological order)


Cheers,
spud.

-------------------------------------------------------------------
a.h.s. boy
spud(at)nothingness.org            "as yes is to if,love is to yes"
http://www.nothingness.org/
-------------------------------------------------------------------

_______________________________________________
phpxmlrpc mailing list
phpxmlrpc@lists.usefulinc.com
http://lists.usefulinc.com/cgi-bin/mailman/listinfo/phpxmlrpc

Re: [phpxmlrpc] xmlrpc_encode_entitites causing parse error

Reply via email to