[Lift] Re: xml parser, utf-8, special characters... kill me now

Charles F. Munat Sun, 15 Mar 2009 14:00:24 -0700

I just went back and tried changing it in the database itself, and that 
worked fine. So now I have a workaround, but it's one that creates a 
huge amount of work for me... :-(


Chas.

Charles F. Munat wrote:
> Oh, sorry, Derek. My bad. I didn't mean to imply that you were saying 
> that the situation was optimal. I understood where you were coming from. 
>   Actually, I wasn't really addressing your comment after my first 
> sentence. I should have made that clear. Haven't had my coffee yet...
> 
> This is kind of important to me. I have a site that is sponsored by some 
> big liquor companies. Many of them are European, and then the Brazilian 
> ones are all selling cachaça. Eliminating accents and changing ç to c 
> does not make them happy, which does not make my client happy. And I 
> can't explain to them why I can't help it because their sites all work 
> fine with ç. So I spent more than 40 hours this week, mostly between 
> midnight and 6 AM, inputing data that my client could have input 
> themselves because I didn't want them to have to deal with this problem. 
> That was above and beyond the 40+ hours I spent programming.
> 
> Now I have to go back and change all those after we figure this out. So 
> it's a pretty major issue for me at the moment.
> 
> I'm thinking that as a workaround, I can go change things directly in 
> the database and see if that helps. Ugh. That's gonna mean another week 
> of no sleep.
> 
> Can you point me to the spot in Lift code where this all happens? I'd 
> love to be part of the solution instead of just the guy who points 
> things out.
> 
> Chas.
> 
> Derek Chen-Becker wrote:
>> Sorry, I'm not suggesting that this is the appropriate method for users; 
>> they should just be able to type. I was just trying to explain why the 
>> "&" is getting expanded. I think that the current behavior is not really 
>> what anyone wants, and hopefully we can fix it in a transparent manner.
>>
>> Derek
>>
>> On Sun, Mar 15, 2009 at 2:38 PM, Charles F. Munat <[email protected] 
>> <mailto:[email protected]>> wrote:
>>
>>
>>     Unfortunately, there is no easy way to do that with user input. But the
>>     use of character entity references is problematic in itself. I can't
>>     teach all my site's users all the references they will need, nor is it
>>     really reasonable to expect, for example, an international group of
>>     users to have to hand code every accented character.
>>
>>     There must be a way to input UTF-8 and have it come out properly. I've
>>     set the keyboard on my Mac to U.S. Extended, which makes everything
>>     UTF-8. I note that *most* of the keyboards available for the Mac are
>>     UTF-8 (though the default U.S. keyboard is Roman, and there are many
>>     European keyboards that are Roman or Cyrillic).
>>
>>     Ideally, Lift would recognize the character encoding and act
>>     appropriately. (I'd be happy to convert everything to UTF-8.) Another
>>     possibility, much less preferred but at least workable, would be to add
>>     the ability for the user to select the character encoding (they could
>>     use trial and error if they weren't sure).
>>
>>     But the upshot is that someone with a keyboard set to UTF-8 (which
>>     includes much of the world) should be able to use that keyboard and have
>>     it come out the same way it went in. I have no idea how to accomplish
>>     this, however, as I don't know how that part of Lift works.
>>
>>     Chas.
>>
>>     Derek Chen-Becker wrote:
>>      > The scala XML syntax automatically converts any "&" in embedded
>>     strings
>>      > to "&amp;". You have to put the string inside a
>>     scala.xml.Unparsed node
>>      > to prevent that from happening.
>>      >
>>      > Derek
>>      >
>>      > On Sun, Mar 15, 2009 at 1:59 PM, Charles F. Munat <[email protected]
>>     <mailto:[email protected]>
>>      > <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>      >
>>      >
>>      >     That was my thinking. It doesn't explain why &ccedil; in gets
>>     changed to
>>      >     &amp;ccedil;, but it explains why ç in becomes Ã§ out. So I
>>     think there
>>      >     are two separate issues here.
>>      >
>>      >     The ç can be created in two different ways in UTF-8. One is
>>     the single
>>      >     "c with a cedilla" character. The second is a c character
>>     followed by a
>>      >     cedilla character. I am not sure how UTF-8 indicates that
>>     these two
>>      >     characters should be displayed as one. Neither am I sure that
>>     this has
>>      >     anything to do with the problem. Maybe it is simply that
>>     something is
>>      >     assuming Latin1 input even though the input is UTF-8.
>>      >
>>      >     It is definitely on the front end, because it is stored in
>>     the database
>>      >     as Ã§.
>>      >
>>      >     When I use &ccedil; instead, the problem is that it is *not*
>>     converted
>>      >     to ç as it goes into the database, and then on the way out
>>     the XML
>>      >     interpreter does not recognize it as a character entity
>>     reference and so
>>      >     converts the & to &amp;.
>>      >
>>      >     Chas.
>>      >
>>      >     Marc Boschma wrote:
>>      >      > Now I have some breakfast in me, to be clear it appears that
>>      >     UTF-8 byte
>>      >      > stream is being interpreted as Latin1 and then converted to
>>      >     unicode...
>>      >      >
>>      >      > Marc
>>      >      > On 16/03/2009, at 6:25 AM, Marc Boschma wrote:
>>      >      >
>>      >      >> excuse the typo:
>>      >      >> On 16/03/2009, at 6:23 AM, Marc Boschma wrote:
>>      >      >>
>>      >      >>> Just looking at http://jeppesn.dk/utf-8.html , I found the
>>      >     following
>>      >      >>> lines:
>>      >      >>> Character   Latin1  Unicode         UTF-8   Latin1
>>      >      >>>                     code
>>      >          interpr.
>>      >      >>> ç                   E7              00 E7           C3
>>     A7   Ã§
>>      >      >>> Ã is C38C, § is C2 A7
>>      >      >> Ã is C383
>>      >      >>> So it appears that somewhere there is a translation to
>>     Latin 1
>>      >     going on.
>>      >      >>> Hopefully that helps some what...
>>      >      >>> Regards,
>>      >      >>> Marc
>>      >      >>>
>>      >      >>> On 16/03/2009, at 1:08 AM, Derek Chen-Becker wrote:
>>      >      >>>
>>      >      >>>> This is really interesting. I've narrowed it down to
>>     something on
>>      >      >>>> form submission. The database shows gibberish, too, and
>>     if I
>>      >      >>>> manually enter the correct value in the DB it works fine on
>>      >     display.
>>      >      >>>> If I print the UTF-8 byte values of the string I get
>>     from the
>>      >      >>>> browser for my description when I submit a cedilla (ç),
>>     I see:
>>      >      >>>>
>>      >      >>>> INFO - Submitted desc bytes = c3 83 c2 a7
>>      >      >>>>
>>      >      >>>> A cedilla is c3 a7 in UTF-8, so I'm not sure where the
>>     "83 c2" is
>>      >      >>>> coming from. I googled around a bit and I found other
>>     people
>>      >     having
>>      >      >>>> the same issue but it wasn't clear in those posts what
>>     the cause
>>      >      >>>> was. I did a packet capture just as a sanity check, and
>>     here's
>>      >     what
>>      >      >>>> I got:
>>      >      >>>>
>>      >      >>>> POST / HTTP/1.1
>>      >      >>>> ... headers here ...
>>      >      >>>>
>>      >      >>>>
>>      >    
>>     
>> F956759623045OFT=true&F956759623046BU5=1&F9567596230472LR=2009%2F03%2F18&F956759623048IZR=%C3%A7&F956759623049S3E=3&F956759623050E25=test
>>      >      >>>>
>>      >      >>>> As you can see, the (url encoded) value of the
>>     F956759623048IZR
>>      >      >>>> field (description) is %C3%A7, so something isn't properly
>>      >      >>>> converting that. Helpers.urlDecode seems to be working
>>     properly:
>>      >      >>>>
>>      >      >>>> scala> Helpers.urlDecode("F956759623048IZR=%C3%A7")
>>      >      >>>> res1: java.lang.String = F956759623048IZR=ç
>>      >      >>>>
>>      >      >>>> So I have no idea where this is coming from. All I know
>>     is that
>>      >      >>>> between the actual POST and when my submit function is
>>     called,
>>      >      >>>> something is tweaking the string. I'm going to dig some
>>     more,
>>      >     but I
>>      >      >>>> wanted to post this in case it triggers any thoughts
>>     out there.
>>      >      >>>>
>>      >      >>>> Derek
>>      >      >>>>
>>      >      >>>> PS - I just found this:
>>      >      >>>>
>>      >      >>>>
>>      >    
>>     
>> http://mail-archives.apache.org/mod_mbox/struts-dev/200604.mbox/%3c3769847.1145910729808.javamail.j...@brutus%3e
>>      >      >>>>
>>      >      >>>> May be related?
>>      >      >>>>
>>      >      >>>> On Sun, Mar 15, 2009 at 7:26 AM, Derek Chen-Becker
>>      >      >>>> <[email protected] <mailto:[email protected]>
>>     <mailto:[email protected] <mailto:[email protected]>>
>>      >     <mailto:[email protected] <mailto:[email protected]>
>>     <mailto:[email protected] <mailto:[email protected]>>>> wrote:
>>      >      >>>>
>>      >      >>>>     OK, I can replicate this in our PocketChange app
>>     (also going
>>      >      >>>>     against a PostgreSQL DB). Let me dig a bit.
>>      >      >>>>
>>      >      >>>>     Derek
>>      >      >>>>
>>      >      >>>>
>>      >      >>>>     On Sun, Mar 15, 2009 at 3:58 AM, Charles F. Munat
>>      >      >>>>     <[email protected] <mailto:[email protected]>
>>     <mailto:[email protected] <mailto:[email protected]>>
>>      >     <mailto:[email protected] <mailto:[email protected]>
>>     <mailto:[email protected] <mailto:[email protected]>>>> wrote:
>>      >      >>>>
>>      >      >>>>
>>      >      >>>>         This might help, but I don't think I was clear.
>>     I have an
>>      >      >>>>         online form.
>>      >      >>>>         My clients enter text into it. Their text has
>>     characters
>>      >      >>>>         like a c with a
>>      >      >>>>         cedilla. That text gets saved into a PostgreSQL
>>     database
>>      >      >>>>         (UTF-8) varchar
>>      >      >>>>         field via JPA/Hibernate.
>>      >      >>>>
>>      >      >>>>         Then I pull it back out and dump it into a
>>     template,
>>      >     and it
>>      >      >>>>         comes out
>>      >      >>>>         gibberish. If I try using &ccedil; instead, I get
>>      >      >>>>         &amp;cedil; back out.
>>      >      >>>>
>>      >      >>>>         Here is what I have:
>>      >      >>>>
>>      >      >>>>         "name" -> SHtml.text(thing.name
>>     <http://thing.name> <http://thing.name>
>>      >     <http://thing.name>,
>>      >      >>>>         thing.name <http://thing.name>
>>     <http://thing.name> <http://thing.name> =
>>      >     _, ("size", "40"))
>>      >      >>>>
>>      >      >>>>         If I enter "cachaça" in the field, I get
>>     cachaÃ§a back
>>      >     out.
>>      >      >>>>         The weird
>>      >      >>>>         thing is that sometimes when I copy and paste
>>     text from
>>      >      >>>>         another document
>>      >      >>>>         into the form, it works. But if I use the
>>     keyboard, it
>>      >     fails
>>      >      >>>>         every time.
>>      >      >>>>
>>      >      >>>>         I'll play around with this. Thanks.
>>      >      >>>>
>>      >      >>>>         Chas.
>>      >      >>>>
>>      >      >>>>         Derek Chen-Becker wrote:
>>      >      >>>>         > Oops, forgot scala.xml.Unparsed, too:
>>      >      >>>>         >
>>      >      >>>>         > scala> val m = <span>a{
>>     scala.xml.Unparsed("&ccedil;")
>>      >      >>>>         }b</span>
>>      >      >>>>         > m: scala.xml.Elem = <span>a&ccedil;b</span>
>>      >      >>>>         >
>>      >      >>>>         > That one might be what you're looking for.
>>      >      >>>>         >
>>      >      >>>>         > Derek
>>      >      >>>>         >
>>      >      >>>>         > On Sat, Mar 14, 2009 at 9:57 PM, Derek
>>     Chen-Becker
>>      >      >>>>         > <[email protected]
>>     <mailto:[email protected]>
>>      >     <mailto:[email protected] <mailto:[email protected]>>
>>     <mailto:[email protected] <mailto:[email protected]>
>>      >     <mailto:[email protected] <mailto:[email protected]>>>
>>      >      >>>>         <mailto:[email protected]
>>     <mailto:[email protected]>
>>      >     <mailto:[email protected] <mailto:[email protected]>>
>>      >      >>>>         <mailto:[email protected]
>>     <mailto:[email protected]>
>>      >     <mailto:[email protected]
>>     <mailto:[email protected]>>>>> wrote:
>>      >      >>>>         >
>>      >      >>>>         >     I think it depends on how you're
>>     embedding them
>>      >     in the
>>      >      >>>>         XML:
>>      >      >>>>         >
>>      >      >>>>         >     scala> val m = <span>a&ccedil;b</span>
>>      >      >>>>         >     m: scala.xml.Elem = <span>a&ccedil;b</span>
>>      >      >>>>         >
>>      >      >>>>         >     scala> val m = <span>a{"&ccedil;"}b</span>
>>      >      >>>>         >     m: scala.xml.Elem =
>>     <span>a&amp;ccedil;b</span>
>>      >      >>>>         >
>>      >      >>>>         >     scala> val m = <span>a{"ç"}b</span>
>>      >      >>>>         >     m: scala.xml.Elem = <span>açb</span>
>>      >      >>>>         >
>>      >      >>>>         >     That last one was input using dead keys
>>     (alt+,)
>>      >     on my
>>      >      >>>>         linux (USA
>>      >      >>>>         >     International with dead keys) layout. Let
>>     me know if
>>      >      >>>>         this doesn't
>>      >      >>>>         >     help; if not, could you send the
>>     code/template
>>      >     that's
>>      >      >>>>         having issues?
>>      >      >>>>         >
>>      >      >>>>         >     Derek
>>      >      >>>>         >
>>      >      >>>>         >
>>      >      >>>>         >     On Sat, Mar 14, 2009 at 6:36 PM, Charles
>>     F. Munat
>>      >      >>>>         <[email protected] <mailto:[email protected]>
>>     <mailto:[email protected] <mailto:[email protected]>>
>>      >     <mailto:[email protected] <mailto:[email protected]>
>>     <mailto:[email protected] <mailto:[email protected]>>>
>>      >      >>>>         >     <mailto:[email protected]
>>     <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>>
>>      >     <mailto:[email protected] <mailto:[email protected]>
>>     <mailto:[email protected] <mailto:[email protected]>>>>> wrote:
>>      >      >>>>         >
>>      >      >>>>         >
>>      >      >>>>         >         I have a site that uses a lot of
>>     "special"
>>      >      >>>>         characters (a remarkably
>>      >      >>>>         >         biased description, since there is
>>     nothing
>>      >      >>>>         "special" about accented
>>      >      >>>>         >         characters to the people who use them
>>     daily). In
>>      >      >>>>         particular, I
>>      >      >>>>         >         need the
>>      >      >>>>         >         c with cedilla and the n with the tilde.
>>      >      >>>>         >
>>      >      >>>>         >         These characters are being input to a
>>     database
>>      >      >>>>         (UTF-8) via an online
>>      >      >>>>         >         form, then spit back out onto the page.
>>      >      >>>>         >
>>      >      >>>>         >         It's a fucking disaster. Apparently,
>>     everything
>>      >      >>>>         goes through the xml
>>      >      >>>>         >         parser, which is great, except when I
>>     try to
>>      >     enter
>>      >      >>>>         these as entity
>>      >      >>>>         >         references, such as &ccedil;, the parser
>>      >     changes &
>>      >      >>>>         to &amp; and
>>      >      >>>>         >         I get
>>      >      >>>>         >         the literal &ccedil; back out again.
>>      >      >>>>         >
>>      >      >>>>         >         When I type ç using the keyboard (or
>>     copy and
>>      >      >>>>         paste it from a
>>      >      >>>>         >         page or a
>>      >      >>>>         >         text editor), I get gibberish.
>>      >      >>>>         >
>>      >      >>>>         >         Anyone know the trick to getting
>>     around this? I
>>      >      >>>>         need everything
>>      >      >>>>         >         from e
>>      >      >>>>         >         acute to e grave to trademark and
>>     registered
>>      >      >>>>         trademark symbols,
>>      >      >>>>         >         and I
>>      >      >>>>         >         need to enter them this way.
>>      >      >>>>         >
>>      >      >>>>         >         Thanks for any help. If I can get
>>     this to work,
>>      >      >>>>         I'll add an
>>      >      >>>>         >         explanation
>>      >      >>>>         >         to the wiki.
>>      >      >>>>         >
>>      >      >>>>         >         Chas.
>>      >      >>>>         >
>>      >      >>>>         >
>>      >      >>>>         >
>>      >      >>>>         >
>>      >      >>>>         >
>>      >      >>>>         > >
>>      >      >>>>
>>      >      >>>>
>>      >      >>>>
>>      >      >>>>
>>      >      >>>>
>>      >      >>>>
>>      >      >>>>
>>      >      >>>
>>      >      >>>
>>      >      >>>
>>      >      >>>
>>      >      >>
>>      >      >>
>>      >      >>
>>      >      >>
>>      >      >
>>      >      >
>>      >      > >
>>      >
>>      >
>>      >
>>      >
>>      > >
>>
>>
>>
>>
> 
> > 

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Lift" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/liftweb?hl=en
-~----------~----~----~----~------~----~------~--~---

[Lift] Re: xml parser, utf-8, special characters... kill me now

Reply via email to