[Lift] Re: xml parser, utf-8, special characters... kill me now

Charles F. Munat Sun, 15 Mar 2009 13:38:46 -0700

Unfortunately, there is no easy way to do that with user input. But the 
use of character entity references is problematic in itself. I can't 
teach all my site's users all the references they will need, nor is it 
really reasonable to expect, for example, an international group of 
users to have to hand code every accented character.


There must be a way to input UTF-8 and have it come out properly. I've 
set the keyboard on my Mac to U.S. Extended, which makes everything 
UTF-8. I note that *most* of the keyboards available for the Mac are 
UTF-8 (though the default U.S. keyboard is Roman, and there are many 
European keyboards that are Roman or Cyrillic).

Ideally, Lift would recognize the character encoding and act 
appropriately. (I'd be happy to convert everything to UTF-8.) Another 
possibility, much less preferred but at least workable, would be to add 
the ability for the user to select the character encoding (they could 
use trial and error if they weren't sure).

But the upshot is that someone with a keyboard set to UTF-8 (which 
includes much of the world) should be able to use that keyboard and have 
it come out the same way it went in. I have no idea how to accomplish 
this, however, as I don't know how that part of Lift works.

Chas.

Derek Chen-Becker wrote:
> The scala XML syntax automatically converts any "&" in embedded strings 
> to "&amp;". You have to put the string inside a scala.xml.Unparsed node 
> to prevent that from happening.
> 
> Derek
> 
> On Sun, Mar 15, 2009 at 1:59 PM, Charles F. Munat <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> 
>     That was my thinking. It doesn't explain why &ccedil; in gets changed to
>     &amp;ccedil;, but it explains why ç in becomes Ã§ out. So I think there
>     are two separate issues here.
> 
>     The ç can be created in two different ways in UTF-8. One is the single
>     "c with a cedilla" character. The second is a c character followed by a
>     cedilla character. I am not sure how UTF-8 indicates that these two
>     characters should be displayed as one. Neither am I sure that this has
>     anything to do with the problem. Maybe it is simply that something is
>     assuming Latin1 input even though the input is UTF-8.
> 
>     It is definitely on the front end, because it is stored in the database
>     as Ã§.
> 
>     When I use &ccedil; instead, the problem is that it is *not* converted
>     to ç as it goes into the database, and then on the way out the XML
>     interpreter does not recognize it as a character entity reference and so
>     converts the & to &amp;.
> 
>     Chas.
> 
>     Marc Boschma wrote:
>      > Now I have some breakfast in me, to be clear it appears that
>     UTF-8 byte
>      > stream is being interpreted as Latin1 and then converted to
>     unicode...
>      >
>      > Marc
>      > On 16/03/2009, at 6:25 AM, Marc Boschma wrote:
>      >
>      >> excuse the typo:
>      >> On 16/03/2009, at 6:23 AM, Marc Boschma wrote:
>      >>
>      >>> Just looking at http://jeppesn.dk/utf-8.html , I found the
>     following
>      >>> lines:
>      >>> Character   Latin1  Unicode         UTF-8   Latin1
>      >>>                     code                                      
>          interpr.
>      >>> ç                   E7              00 E7           C3 A7   Ã§
>      >>> Ã is C38C, § is C2 A7
>      >> Ã is C383
>      >>> So it appears that somewhere there is a translation to Latin 1
>     going on.
>      >>> Hopefully that helps some what...
>      >>> Regards,
>      >>> Marc
>      >>>
>      >>> On 16/03/2009, at 1:08 AM, Derek Chen-Becker wrote:
>      >>>
>      >>>> This is really interesting. I've narrowed it down to something on
>      >>>> form submission. The database shows gibberish, too, and if I
>      >>>> manually enter the correct value in the DB it works fine on
>     display.
>      >>>> If I print the UTF-8 byte values of the string I get from the
>      >>>> browser for my description when I submit a cedilla (ç), I see:
>      >>>>
>      >>>> INFO - Submitted desc bytes = c3 83 c2 a7
>      >>>>
>      >>>> A cedilla is c3 a7 in UTF-8, so I'm not sure where the "83 c2" is
>      >>>> coming from. I googled around a bit and I found other people
>     having
>      >>>> the same issue but it wasn't clear in those posts what the cause
>      >>>> was. I did a packet capture just as a sanity check, and here's
>     what
>      >>>> I got:
>      >>>>
>      >>>> POST / HTTP/1.1
>      >>>> ... headers here ...
>      >>>>
>      >>>>
>     
> F956759623045OFT=true&F956759623046BU5=1&F9567596230472LR=2009%2F03%2F18&F956759623048IZR=%C3%A7&F956759623049S3E=3&F956759623050E25=test
>      >>>>
>      >>>> As you can see, the (url encoded) value of the F956759623048IZR
>      >>>> field (description) is %C3%A7, so something isn't properly
>      >>>> converting that. Helpers.urlDecode seems to be working properly:
>      >>>>
>      >>>> scala> Helpers.urlDecode("F956759623048IZR=%C3%A7")
>      >>>> res1: java.lang.String = F956759623048IZR=ç
>      >>>>
>      >>>> So I have no idea where this is coming from. All I know is that
>      >>>> between the actual POST and when my submit function is called,
>      >>>> something is tweaking the string. I'm going to dig some more,
>     but I
>      >>>> wanted to post this in case it triggers any thoughts out there.
>      >>>>
>      >>>> Derek
>      >>>>
>      >>>> PS - I just found this:
>      >>>>
>      >>>>
>     
> http://mail-archives.apache.org/mod_mbox/struts-dev/200604.mbox/%3c3769847.1145910729808.javamail.j...@brutus%3e
>      >>>>
>      >>>> May be related?
>      >>>>
>      >>>> On Sun, Mar 15, 2009 at 7:26 AM, Derek Chen-Becker
>      >>>> <[email protected] <mailto:[email protected]>
>     <mailto:[email protected] <mailto:[email protected]>>> wrote:
>      >>>>
>      >>>>     OK, I can replicate this in our PocketChange app (also going
>      >>>>     against a PostgreSQL DB). Let me dig a bit.
>      >>>>
>      >>>>     Derek
>      >>>>
>      >>>>
>      >>>>     On Sun, Mar 15, 2009 at 3:58 AM, Charles F. Munat
>      >>>>     <[email protected] <mailto:[email protected]>
>     <mailto:[email protected] <mailto:[email protected]>>> wrote:
>      >>>>
>      >>>>
>      >>>>         This might help, but I don't think I was clear. I have an
>      >>>>         online form.
>      >>>>         My clients enter text into it. Their text has characters
>      >>>>         like a c with a
>      >>>>         cedilla. That text gets saved into a PostgreSQL database
>      >>>>         (UTF-8) varchar
>      >>>>         field via JPA/Hibernate.
>      >>>>
>      >>>>         Then I pull it back out and dump it into a template,
>     and it
>      >>>>         comes out
>      >>>>         gibberish. If I try using &ccedil; instead, I get
>      >>>>         &amp;cedil; back out.
>      >>>>
>      >>>>         Here is what I have:
>      >>>>
>      >>>>         "name" -> SHtml.text(thing.name <http://thing.name>
>     <http://thing.name>,
>      >>>>         thing.name <http://thing.name> <http://thing.name> =
>     _, ("size", "40"))
>      >>>>
>      >>>>         If I enter "cachaça" in the field, I get cachaÃ§a back
>     out.
>      >>>>         The weird
>      >>>>         thing is that sometimes when I copy and paste text from
>      >>>>         another document
>      >>>>         into the form, it works. But if I use the keyboard, it
>     fails
>      >>>>         every time.
>      >>>>
>      >>>>         I'll play around with this. Thanks.
>      >>>>
>      >>>>         Chas.
>      >>>>
>      >>>>         Derek Chen-Becker wrote:
>      >>>>         > Oops, forgot scala.xml.Unparsed, too:
>      >>>>         >
>      >>>>         > scala> val m = <span>a{ scala.xml.Unparsed("&ccedil;")
>      >>>>         }b</span>
>      >>>>         > m: scala.xml.Elem = <span>a&ccedil;b</span>
>      >>>>         >
>      >>>>         > That one might be what you're looking for.
>      >>>>         >
>      >>>>         > Derek
>      >>>>         >
>      >>>>         > On Sat, Mar 14, 2009 at 9:57 PM, Derek Chen-Becker
>      >>>>         > <[email protected]
>     <mailto:[email protected]> <mailto:[email protected]
>     <mailto:[email protected]>>
>      >>>>         <mailto:[email protected]
>     <mailto:[email protected]>
>      >>>>         <mailto:[email protected]
>     <mailto:[email protected]>>>> wrote:
>      >>>>         >
>      >>>>         >     I think it depends on how you're embedding them
>     in the
>      >>>>         XML:
>      >>>>         >
>      >>>>         >     scala> val m = <span>a&ccedil;b</span>
>      >>>>         >     m: scala.xml.Elem = <span>a&ccedil;b</span>
>      >>>>         >
>      >>>>         >     scala> val m = <span>a{"&ccedil;"}b</span>
>      >>>>         >     m: scala.xml.Elem = <span>a&amp;ccedil;b</span>
>      >>>>         >
>      >>>>         >     scala> val m = <span>a{"ç"}b</span>
>      >>>>         >     m: scala.xml.Elem = <span>açb</span>
>      >>>>         >
>      >>>>         >     That last one was input using dead keys (alt+,)
>     on my
>      >>>>         linux (USA
>      >>>>         >     International with dead keys) layout. Let me know if
>      >>>>         this doesn't
>      >>>>         >     help; if not, could you send the code/template
>     that's
>      >>>>         having issues?
>      >>>>         >
>      >>>>         >     Derek
>      >>>>         >
>      >>>>         >
>      >>>>         >     On Sat, Mar 14, 2009 at 6:36 PM, Charles F. Munat
>      >>>>         <[email protected] <mailto:[email protected]>
>     <mailto:[email protected] <mailto:[email protected]>>
>      >>>>         >     <mailto:[email protected] <mailto:[email protected]>
>     <mailto:[email protected] <mailto:[email protected]>>>> wrote:
>      >>>>         >
>      >>>>         >
>      >>>>         >         I have a site that uses a lot of "special"
>      >>>>         characters (a remarkably
>      >>>>         >         biased description, since there is nothing
>      >>>>         "special" about accented
>      >>>>         >         characters to the people who use them daily). In
>      >>>>         particular, I
>      >>>>         >         need the
>      >>>>         >         c with cedilla and the n with the tilde.
>      >>>>         >
>      >>>>         >         These characters are being input to a database
>      >>>>         (UTF-8) via an online
>      >>>>         >         form, then spit back out onto the page.
>      >>>>         >
>      >>>>         >         It's a fucking disaster. Apparently, everything
>      >>>>         goes through the xml
>      >>>>         >         parser, which is great, except when I try to
>     enter
>      >>>>         these as entity
>      >>>>         >         references, such as &ccedil;, the parser
>     changes &
>      >>>>         to &amp; and
>      >>>>         >         I get
>      >>>>         >         the literal &ccedil; back out again.
>      >>>>         >
>      >>>>         >         When I type ç using the keyboard (or copy and
>      >>>>         paste it from a
>      >>>>         >         page or a
>      >>>>         >         text editor), I get gibberish.
>      >>>>         >
>      >>>>         >         Anyone know the trick to getting around this? I
>      >>>>         need everything
>      >>>>         >         from e
>      >>>>         >         acute to e grave to trademark and registered
>      >>>>         trademark symbols,
>      >>>>         >         and I
>      >>>>         >         need to enter them this way.
>      >>>>         >
>      >>>>         >         Thanks for any help. If I can get this to work,
>      >>>>         I'll add an
>      >>>>         >         explanation
>      >>>>         >         to the wiki.
>      >>>>         >
>      >>>>         >         Chas.
>      >>>>         >
>      >>>>         >
>      >>>>         >
>      >>>>         >
>      >>>>         >
>      >>>>         > >
>      >>>>
>      >>>>
>      >>>>
>      >>>>
>      >>>>
>      >>>>
>      >>>>
>      >>>
>      >>>
>      >>>
>      >>>
>      >>
>      >>
>      >>
>      >>
>      >
>      >
>      > >
> 
> 
> 
> 
> > 

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Lift" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/liftweb?hl=en
-~----------~----~----~----~------~----~------~--~---

[Lift] Re: xml parser, utf-8, special characters... kill me now

Reply via email to