I have a browser that sends a POST request with:

  content-type: application/x-www-form-urlencoded

and the hidden field "content" is populated (using client-side
javascript) with some xml which looks like this

   <?xml version="1.0" encoding="UTF-8"?>
   <page>
    <title>Title</title>
    <abstract>è</abstract>
    ...
   </page>

the weird "è" text is the UTF-8 encoded value for [è] (depending on
your mail client you might not be getting nothing of the above as I
write it, but that's exactly part of the encoding nightmare that UTF was
designed to fix... but there is still a long way to go)

Now, I have use StreamGenerator to get this text, have it parsed and
feed my pipeline. So far so good.

The problem is that stupid StreamGenerator doesn't recognize the
encoding (because the content-type doesn't have the 'charset:' part
defined (and IE can't be tweaked to emit that, AFAIK)) so it spits the
charachers "as they are" (as they were ASCII encoded) (I used the
LogTransformer to witness this and the same weird 'è' appears in the
logs with no encoding translating taking place).

It seems that StreamGenerator (or the parser instance it instantiates)
fails to see that 'è' is not two 8bits chars but one 16bit char.

I'm positive the bug resides on StreamGenerator: in fact, if I tweak the
javascript to fill the form content with 

   <?xml version="1.0" encoding="BLAH"?>

the parser doesn't even trigger an error.

I'm going to investigate how to patch this since I need it badly! but if
you have any suggestions I'm all ears.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<[EMAIL PROTECTED]>                             Friedrich Nietzsche
--------------------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]

Reply via email to