On Nov 15, 2005, at 4:11 AM, Gaetano Giunta wrote:

Brief analysis:

- the lib tries to encode all chars outside of the ASCII range as 'XML character entity' when serializing

I understand the theory, but one of the benefits to using UTF-8 in the first place is its ability to properly render all sorts of languages and character sets. Debugging becomes brutal when you're staring at a huge string of HTML entities.

- this has the main benefit that such an xml is valid regardless of the charset assumed by the parser, i.e. we do not need to add a 'charset' parameter to either the HTTP Content-type header or the XML prologue

Well...apparently it isn't valid XML despite the lack of charset...or we wouldn't be having this discussion! ;-)

- it is also the best solution I could come up with to solve the long-standing problems with cahrset encodings (I also tried the other way round, e.g. explicitly stating the charset used for xml, in a private fork of the lib I use for personal projects, but I would rather stick with the current approach, as it solves the problem in a more elegant way)

Believe me, I totally understand the issue of long-standing charset encoding problems! I've been developing a CMS that needs to handle multiple languages, alphabets, directionality, and XML-RPC/RSS feeds all on the same page! Not easy, especially if your own linguistic range is limited to English and Romance languages!

But I'm also a fan of proper declarations...and I'd rather have an XML feed explicitly declare its charset encoding (and work) than try to be "universal" and fail. :-)

I'll admit to not being fully familiar with all the XMLRPC library code -- only enough to debug a bit -- but it appears that $xmlrpc_internalencoding is declared as a global variable, though it is only used in object methods. Could it be changed to be a property of the xmlrpcmsg and xmlrpc_server classes? That way it could be set through scripting with

$xmlrpcmsg->set_internalencoding($foo);

or something similar? That would be more flexible, and since you _always_ know what the encoding is, you can send it in the XML prologue, which is what that parameter is designed for anyway.

- basically, I see two options to extend the lib to make up for your problem: + extend the xmlrpc_encode_entitites function to take into account the xmlrpc_internalencoding global var, and use 2 different parsing alghoritms (better solution but slower)

Well...UTF-8 should only require converting "&", "<", and '"' explicitly, and the rest is assumed to be valid. So the only fork you'd need in the code is to convert additional entities for non- UTF-8 encodings. Shouldn't slow anything down...in fact, it would make UTF-8 faster, since it would skip additional processing.

In fact, I may be mistaken, but it seems like older versions of the library didn't even do the entity translation...at least, in the course of my own development, I know I included some entity conversion routines to process the data _before_ I sent it to the XMLRPC library (but it may have been redundant on my part). Though I admit I do like the idea that I can pass _anything_ to the XMLRPC library and have it properly encoded for me!

Would you be willing to test the patches?

Absolutely...but I do think you should give some serious thought to making the internal encoding variable more scriptable so no one ever needs to hard-code changes in the script file. I hate having to remember to change the variable value whenever I upgrade the library...

Cheers,
spud.


-------------------------------------------------------------------
a.h.s. boy
spud(at)nothingness.org            "as yes is to if,love is to yes"
http://www.nothingness.org/
-------------------------------------------------------------------

_______________________________________________
phpxmlrpc mailing list
phpxmlrpc@lists.usefulinc.com
http://lists.usefulinc.com/cgi-bin/mailman/listinfo/phpxmlrpc

Reply via email to