On 16/03/2009, at 6:59 AM, Charles F. Munat wrote: > > That was my thinking. It doesn't explain why ç in gets > changed to > ç, but it explains why ç in becomes ç out. So I think > there > are two separate issues here.
I tend to agree. > > > The ç can be created in two different ways in UTF-8. One is the single > "c with a cedilla" character. The second is a c character followed > by a > cedilla character. I am not sure how UTF-8 indicates that these two > characters should be displayed as one. the c with a cedilla two character sequence is encoded as 0063 0327 which is equivalent to 00E7 (at least optically). the 0327 is seen as a modifier to the 'c' (0063) character. > Neither am I sure that this has > anything to do with the problem. Maybe it is simply that something is > assuming Latin1 input even though the input is UTF-8. > > It is definitely on the front end, because it is stored in the > database > as ç. > > When I use ç instead, the problem is that it is *not* converted > to ç as it goes into the database, and then on the way out the XML > interpreter does not recognize it as a character entity reference > and so > converts the & to &. I think this is due to using the standard Scala XML load functions rather than the lift XML parser. From memory I don't think the standard parser recognises that many named entities. ie. does ç work instead of ç ? If so then that is probably what is happening on this issue. > > > Chas. > > Marc Boschma wrote: >> Now I have some breakfast in me, to be clear it appears that UTF-8 >> byte >> stream is being interpreted as Latin1 and then converted to >> unicode... >> >> Marc >> On 16/03/2009, at 6:25 AM, Marc Boschma wrote: >> >>> excuse the typo: >>> On 16/03/2009, at 6:23 AM, Marc Boschma wrote: >>> >>>> Just looking at http://jeppesn.dk/utf-8.html , I found the >>>> following >>>> lines: >>>> Character Latin1 Unicode UTF-8 Latin1 >>>> code interpr. >>>> ç E7 00 E7 C3 A7 ç >>>> à is C38C, § is C2 A7 >>> à is C383 >>>> So it appears that somewhere there is a translation to Latin 1 >>>> going on. >>>> Hopefully that helps some what... >>>> Regards, >>>> Marc >>>> >>>> On 16/03/2009, at 1:08 AM, Derek Chen-Becker wrote: >>>> >>>>> This is really interesting. I've narrowed it down to something on >>>>> form submission. The database shows gibberish, too, and if I >>>>> manually enter the correct value in the DB it works fine on >>>>> display. >>>>> If I print the UTF-8 byte values of the string I get from the >>>>> browser for my description when I submit a cedilla (ç), I see: >>>>> >>>>> INFO - Submitted desc bytes = c3 83 c2 a7 >>>>> >>>>> A cedilla is c3 a7 in UTF-8, so I'm not sure where the "83 c2" is >>>>> coming from. I googled around a bit and I found other people >>>>> having >>>>> the same issue but it wasn't clear in those posts what the cause >>>>> was. I did a packet capture just as a sanity check, and here's >>>>> what >>>>> I got: >>>>> >>>>> POST / HTTP/1.1 >>>>> ... headers here ... >>>>> >>>>> F956759623045OFT >>>>> = >>>>> true >>>>> &F956759623046BU5 >>>>> =1&F9567596230472LR=2009%2F03%2F18&F956759623048IZR= >>>>> %C3%A7&F956759623049S3E=3&F956759623050E25=test >>>>> >>>>> As you can see, the (url encoded) value of the F956759623048IZR >>>>> field (description) is %C3%A7, so something isn't properly >>>>> converting that. Helpers.urlDecode seems to be working properly: >>>>> >>>>> scala> Helpers.urlDecode("F956759623048IZR=%C3%A7") >>>>> res1: java.lang.String = F956759623048IZR=ç >>>>> >>>>> So I have no idea where this is coming from. All I know is that >>>>> between the actual POST and when my submit function is called, >>>>> something is tweaking the string. I'm going to dig some more, >>>>> but I >>>>> wanted to post this in case it triggers any thoughts out there. >>>>> >>>>> Derek >>>>> >>>>> PS - I just found this: >>>>> >>>>> http://mail-archives.apache.org/mod_mbox/struts-dev/200604.mbox/%3c3769847.1145910729808.javamail.j...@brutus%3e >>>>> >>>>> May be related? >>>>> >>>>> On Sun, Mar 15, 2009 at 7:26 AM, Derek Chen-Becker >>>>> <[email protected] <mailto:[email protected]>> wrote: >>>>> >>>>> OK, I can replicate this in our PocketChange app (also going >>>>> against a PostgreSQL DB). Let me dig a bit. >>>>> >>>>> Derek >>>>> >>>>> >>>>> On Sun, Mar 15, 2009 at 3:58 AM, Charles F. Munat >>>>> <[email protected] <mailto:[email protected]>> wrote: >>>>> >>>>> >>>>> This might help, but I don't think I was clear. I have an >>>>> online form. >>>>> My clients enter text into it. Their text has characters >>>>> like a c with a >>>>> cedilla. That text gets saved into a PostgreSQL database >>>>> (UTF-8) varchar >>>>> field via JPA/Hibernate. >>>>> >>>>> Then I pull it back out and dump it into a template, and it >>>>> comes out >>>>> gibberish. If I try using ç instead, I get >>>>> &cedil; back out. >>>>> >>>>> Here is what I have: >>>>> >>>>> "name" -> SHtml.text(thing.name <http://thing.name>, >>>>> thing.name <http://thing.name> = _, ("size", "40")) >>>>> >>>>> If I enter "cachaça" in the field, I get cachaça back out. >>>>> The weird >>>>> thing is that sometimes when I copy and paste text from >>>>> another document >>>>> into the form, it works. But if I use the keyboard, it >>>>> fails >>>>> every time. >>>>> >>>>> I'll play around with this. Thanks. >>>>> >>>>> Chas. >>>>> >>>>> Derek Chen-Becker wrote: >>>>>> Oops, forgot scala.xml.Unparsed, too: >>>>>> >>>>>> scala> val m = <span>a{ scala.xml.Unparsed("ç") >>>>> }b</span> >>>>>> m: scala.xml.Elem = <span>açb</span> >>>>>> >>>>>> That one might be what you're looking for. >>>>>> >>>>>> Derek >>>>>> >>>>>> On Sat, Mar 14, 2009 at 9:57 PM, Derek Chen-Becker >>>>>> <[email protected] <mailto:[email protected]> >>>>> <mailto:[email protected] >>>>> <mailto:[email protected]>>> wrote: >>>>>> >>>>>> I think it depends on how you're embedding them in the >>>>> XML: >>>>>> >>>>>> scala> val m = <span>açb</span> >>>>>> m: scala.xml.Elem = <span>açb</span> >>>>>> >>>>>> scala> val m = <span>a{"ç"}b</span> >>>>>> m: scala.xml.Elem = <span>a&ccedil;b</span> >>>>>> >>>>>> scala> val m = <span>a{"ç"}b</span> >>>>>> m: scala.xml.Elem = <span>açb</span> >>>>>> >>>>>> That last one was input using dead keys (alt+,) on my >>>>> linux (USA >>>>>> International with dead keys) layout. Let me know if >>>>> this doesn't >>>>>> help; if not, could you send the code/template that's >>>>> having issues? >>>>>> >>>>>> Derek >>>>>> >>>>>> >>>>>> On Sat, Mar 14, 2009 at 6:36 PM, Charles F. Munat >>>>> <[email protected] <mailto:[email protected]> >>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote: >>>>>> >>>>>> >>>>>> I have a site that uses a lot of "special" >>>>> characters (a remarkably >>>>>> biased description, since there is nothing >>>>> "special" about accented >>>>>> characters to the people who use them daily). In >>>>> particular, I >>>>>> need the >>>>>> c with cedilla and the n with the tilde. >>>>>> >>>>>> These characters are being input to a database >>>>> (UTF-8) via an online >>>>>> form, then spit back out onto the page. >>>>>> >>>>>> It's a fucking disaster. Apparently, everything >>>>> goes through the xml >>>>>> parser, which is great, except when I try to enter >>>>> these as entity >>>>>> references, such as ç, the parser changes & >>>>> to & and >>>>>> I get >>>>>> the literal ç back out again. >>>>>> >>>>>> When I type ç using the keyboard (or copy and >>>>> paste it from a >>>>>> page or a >>>>>> text editor), I get gibberish. >>>>>> >>>>>> Anyone know the trick to getting around this? I >>>>> need everything >>>>>> from e >>>>>> acute to e grave to trademark and registered >>>>> trademark symbols, >>>>>> and I >>>>>> need to enter them this way. >>>>>> >>>>>> Thanks for any help. If I can get this to work, >>>>> I'll add an >>>>>> explanation >>>>>> to the wiki. >>>>>> >>>>>> Chas. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> >>> >>> >>> >>> >> >> >>> > > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Lift" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/liftweb?hl=en -~----------~----~----~----~------~----~------~--~---
