[Lift] Re: xml parser, utf-8, special characters... kill me now

Marc Boschma Sun, 15 Mar 2009 13:46:08 -0700

On 16/03/2009, at 6:59 AM, Charles F. Munat wrote:

>
> That was my thinking. It doesn't explain why &ccedil; in gets  
> changed to
> &amp;ccedil;, but it explains why ç in becomes Ã§ out. So I think  
> there
> are two separate issues here.


I tend to agree.

>
>
> The ç can be created in two different ways in UTF-8. One is the single
> "c with a cedilla" character. The second is a c character followed  
> by a
> cedilla character. I am not sure how UTF-8 indicates that these two
> characters should be displayed as one.

the c with a cedilla two character sequence is encoded as 0063 0327  
which is equivalent to 00E7 (at least optically). the 0327 is seen as  
a modifier to the 'c' (0063) character.

> Neither am I sure that this has
> anything to do with the problem. Maybe it is simply that something is
> assuming Latin1 input even though the input is UTF-8.
>
> It is definitely on the front end, because it is stored in the  
> database
> as Ã§.
>
> When I use &ccedil; instead, the problem is that it is *not* converted
> to ç as it goes into the database, and then on the way out the XML
> interpreter does not recognize it as a character entity reference  
> and so
> converts the & to &amp;.

I think this is due to using the standard Scala XML load functions  
rather than the lift XML parser. From memory I don't think the  
standard parser recognises that many named entities. ie. does &#x00E7;  
work instead of &ccedil; ? If so then that is probably what is  
happening on this issue.

>
>
> Chas.
>
> Marc Boschma wrote:
>> Now I have some breakfast in me, to be clear it appears that UTF-8  
>> byte
>> stream is being interpreted as Latin1 and then converted to  
>> unicode...
>>
>> Marc
>> On 16/03/2009, at 6:25 AM, Marc Boschma wrote:
>>
>>> excuse the typo:
>>> On 16/03/2009, at 6:23 AM, Marc Boschma wrote:
>>>
>>>> Just looking at http://jeppesn.dk/utf-8.html , I found the  
>>>> following
>>>> lines:
>>>> Character  Latin1  Unicode         UTF-8   Latin1
>>>>                    code                                            interpr.
>>>> ç                  E7              00 E7           C3 A7   Ã§
>>>> Ã is C38C, § is C2 A7
>>> Ã is C383
>>>> So it appears that somewhere there is a translation to Latin 1  
>>>> going on.
>>>> Hopefully that helps some what...
>>>> Regards,
>>>> Marc
>>>>
>>>> On 16/03/2009, at 1:08 AM, Derek Chen-Becker wrote:
>>>>
>>>>> This is really interesting. I've narrowed it down to something on
>>>>> form submission. The database shows gibberish, too, and if I
>>>>> manually enter the correct value in the DB it works fine on  
>>>>> display.
>>>>> If I print the UTF-8 byte values of the string I get from the
>>>>> browser for my description when I submit a cedilla (ç), I see:
>>>>>
>>>>> INFO - Submitted desc bytes = c3 83 c2 a7
>>>>>
>>>>> A cedilla is c3 a7 in UTF-8, so I'm not sure where the "83 c2" is
>>>>> coming from. I googled around a bit and I found other people  
>>>>> having
>>>>> the same issue but it wasn't clear in those posts what the cause
>>>>> was. I did a packet capture just as a sanity check, and here's  
>>>>> what
>>>>> I got:
>>>>>
>>>>> POST / HTTP/1.1
>>>>> ... headers here ...
>>>>>
>>>>> F956759623045OFT 
>>>>> = 
>>>>> true 
>>>>> &F956759623046BU5 
>>>>> =1&F9567596230472LR=2009%2F03%2F18&F956759623048IZR= 
>>>>> %C3%A7&F956759623049S3E=3&F956759623050E25=test
>>>>>
>>>>> As you can see, the (url encoded) value of the F956759623048IZR
>>>>> field (description) is %C3%A7, so something isn't properly
>>>>> converting that. Helpers.urlDecode seems to be working properly:
>>>>>
>>>>> scala> Helpers.urlDecode("F956759623048IZR=%C3%A7")
>>>>> res1: java.lang.String = F956759623048IZR=ç
>>>>>
>>>>> So I have no idea where this is coming from. All I know is that
>>>>> between the actual POST and when my submit function is called,
>>>>> something is tweaking the string. I'm going to dig some more,  
>>>>> but I
>>>>> wanted to post this in case it triggers any thoughts out there.
>>>>>
>>>>> Derek
>>>>>
>>>>> PS - I just found this:
>>>>>
>>>>> http://mail-archives.apache.org/mod_mbox/struts-dev/200604.mbox/%3c3769847.1145910729808.javamail.j...@brutus%3e
>>>>>
>>>>> May be related?
>>>>>
>>>>> On Sun, Mar 15, 2009 at 7:26 AM, Derek Chen-Becker
>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>
>>>>>    OK, I can replicate this in our PocketChange app (also going
>>>>>    against a PostgreSQL DB). Let me dig a bit.
>>>>>
>>>>>    Derek
>>>>>
>>>>>
>>>>>    On Sun, Mar 15, 2009 at 3:58 AM, Charles F. Munat
>>>>>    <[email protected] <mailto:[email protected]>> wrote:
>>>>>
>>>>>
>>>>>        This might help, but I don't think I was clear. I have an
>>>>>        online form.
>>>>>        My clients enter text into it. Their text has characters
>>>>>        like a c with a
>>>>>        cedilla. That text gets saved into a PostgreSQL database
>>>>>        (UTF-8) varchar
>>>>>        field via JPA/Hibernate.
>>>>>
>>>>>        Then I pull it back out and dump it into a template, and it
>>>>>        comes out
>>>>>        gibberish. If I try using &ccedil; instead, I get
>>>>>        &amp;cedil; back out.
>>>>>
>>>>>        Here is what I have:
>>>>>
>>>>>        "name" -> SHtml.text(thing.name <http://thing.name>,
>>>>>        thing.name <http://thing.name> = _, ("size", "40"))
>>>>>
>>>>>        If I enter "cachaça" in the field, I get cachaÃ§a back out.
>>>>>        The weird
>>>>>        thing is that sometimes when I copy and paste text from
>>>>>        another document
>>>>>        into the form, it works. But if I use the keyboard, it  
>>>>> fails
>>>>>        every time.
>>>>>
>>>>>        I'll play around with this. Thanks.
>>>>>
>>>>>        Chas.
>>>>>
>>>>>        Derek Chen-Becker wrote:
>>>>>> Oops, forgot scala.xml.Unparsed, too:
>>>>>>
>>>>>> scala> val m = <span>a{ scala.xml.Unparsed("&ccedil;")
>>>>>        }b</span>
>>>>>> m: scala.xml.Elem = <span>a&ccedil;b</span>
>>>>>>
>>>>>> That one might be what you're looking for.
>>>>>>
>>>>>> Derek
>>>>>>
>>>>>> On Sat, Mar 14, 2009 at 9:57 PM, Derek Chen-Becker
>>>>>> <[email protected] <mailto:[email protected]>
>>>>>        <mailto:[email protected]
>>>>>        <mailto:[email protected]>>> wrote:
>>>>>>
>>>>>>    I think it depends on how you're embedding them in the
>>>>>        XML:
>>>>>>
>>>>>>    scala> val m = <span>a&ccedil;b</span>
>>>>>>    m: scala.xml.Elem = <span>a&ccedil;b</span>
>>>>>>
>>>>>>    scala> val m = <span>a{"&ccedil;"}b</span>
>>>>>>    m: scala.xml.Elem = <span>a&amp;ccedil;b</span>
>>>>>>
>>>>>>    scala> val m = <span>a{"ç"}b</span>
>>>>>>    m: scala.xml.Elem = <span>açb</span>
>>>>>>
>>>>>>    That last one was input using dead keys (alt+,) on my
>>>>>        linux (USA
>>>>>>    International with dead keys) layout. Let me know if
>>>>>        this doesn't
>>>>>>    help; if not, could you send the code/template that's
>>>>>        having issues?
>>>>>>
>>>>>>    Derek
>>>>>>
>>>>>>
>>>>>>    On Sat, Mar 14, 2009 at 6:36 PM, Charles F. Munat
>>>>>        <[email protected] <mailto:[email protected]>
>>>>>>    <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>
>>>>>>
>>>>>>        I have a site that uses a lot of "special"
>>>>>        characters (a remarkably
>>>>>>        biased description, since there is nothing
>>>>>        "special" about accented
>>>>>>        characters to the people who use them daily). In
>>>>>        particular, I
>>>>>>        need the
>>>>>>        c with cedilla and the n with the tilde.
>>>>>>
>>>>>>        These characters are being input to a database
>>>>>        (UTF-8) via an online
>>>>>>        form, then spit back out onto the page.
>>>>>>
>>>>>>        It's a fucking disaster. Apparently, everything
>>>>>        goes through the xml
>>>>>>        parser, which is great, except when I try to enter
>>>>>        these as entity
>>>>>>        references, such as &ccedil;, the parser changes &
>>>>>        to &amp; and
>>>>>>        I get
>>>>>>        the literal &ccedil; back out again.
>>>>>>
>>>>>>        When I type ç using the keyboard (or copy and
>>>>>        paste it from a
>>>>>>        page or a
>>>>>>        text editor), I get gibberish.
>>>>>>
>>>>>>        Anyone know the trick to getting around this? I
>>>>>        need everything
>>>>>>        from e
>>>>>>        acute to e grave to trademark and registered
>>>>>        trademark symbols,
>>>>>>        and I
>>>>>>        need to enter them this way.
>>>>>>
>>>>>>        Thanks for any help. If I can get this to work,
>>>>>        I'll add an
>>>>>>        explanation
>>>>>>        to the wiki.
>>>>>>
>>>>>>        Chas.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
>>>
>
> >


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Lift" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/liftweb?hl=en
-~----------~----~----~----~------~----~------~--~---

[Lift] Re: xml parser, utf-8, special characters... kill me now

Reply via email to