Re: UTF8 handling by Apache and Tcl

Massimo Manghi Sun, 29 Sep 2019 04:13:31 -0700



On 9/25/19 7:58 PM, Georgios Petasis wrote:

Dear Massimo,
My advice is to use "encoding system" in your code, and act accordinglyin the code (use or not use encoding convertfrom).This way, the code will work even in cases you cannot control thesettings apache runs with.
Best,
George


Hi George

As I hinted in my first message in this thread strings with accentedcharacters were handled consistently until they went through::rivet::escape_string, before making into a URL.

The problem seems to be related to the byte string returned by this callin ::rivet::escape_string


origString = Tcl_GetStringFromObj( objv[1], &origLength );

with both utf-8 and iso8859-1 system encodings the returned string isinvariably the utf-8 byte representation, which at first made sense tome because I know that Tcl handles string as utf-8 internally. I'm notquestioning what Tcl_GetStringFromObj does but shouldn't at this pointbe replaced by some function that returns a byte string consistent withthe locale?

For example the accented character 'è', which has code 0xe9 as byterepresentation in latin1 (and the same code point in utf-8), isrepresented as 0xc3 0xa9 (utf-8 byte string) and it becomes %c3%a9.After this sequence of bytes has been unescaped it's returned by calling


Tcl_SetObjResult( interp, Tcl_NewStringObj( newString, -1 ) );

and the iso8859-1 machine represents it as Ã©

I'm trying replacing Tcl_GetString... with Tcl_GetByteArrayFromObj (andTcl_NewByteArray). The sequence of the characters is correct but thereis some extra stuff in it that breaks things.


Still working (and wasting time) on it


 -- Massimo





---------------------------------------------------------------------
To unsubscribe, e-mail: rivet-dev-unsubscr...@tcl.apache.org
For additional commands, e-mail: rivet-dev-h...@tcl.apache.org

Re: UTF8 handling by Apache and Tcl

Reply via email to