Re: [Catalyst] tips for troubleshooting/QAing Unicode (was Re: Passing UTF-8 arg in URL to DBIC search)
On Sat, Sep 27, 2008 at 3:39 PM, Darren Duncan [EMAIL PROTECTED] wrote: Maybe you're already aware of this, but I've found from experience that troubleshooting encoding/Unicode problems in a web/db app can be difficult, especially with multiple conversions at different stages, but I've come up with a short generic algorithm to help test/ensure that things are working and where things need fixing. Note that these details assuming we're using Perl 5.8+. [ snip ] Hey Darren, great post! Can you post it on the wiki, perhaps at: http://dev.catalystframework.org/wiki/faq link to Unicode Troubleshooting in the Unicode section there? It would be much appreciated. Thanks, -J ___ List: Catalyst@lists.scsys.co.uk Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/ Dev site: http://dev.catalyst.perl.org/
Re: [Catalyst] tips for troubleshooting/QAing Unicode (was Re: Passing UTF-8 arg in URL to DBIC search)
On Sat, Sep 27, 2008 at 03:39:03PM -0700, Darren Duncan wrote: Maybe you're already aware of this, but I've found from experience that troubleshooting encoding/Unicode problems in a web/db app can be difficult, especially with multiple conversions at different stages, but I've come up with a short generic algorithm to help test/ensure that things are working and where things need fixing. Note that these details assuming we're using Perl 5.8+. ... lots of good tips... Great timing on this as I am currently struggling with some unicode text not displaying correctly in an application I am working on. Per your suggestion I put the Japanese text at the top of my template. All of a sudden the browsers started displaying that and other non-ascii characters correctly. The second I take away the Japanese text it goes back to just showing question marks. I am seeing this behavior in both the test server and Apache. I have looked at the Content-Type header and it is definitely serving it as utf-8, so I am at abit of a loss. There are no databases involved here, but I am displaying information from IMDB::Film. Is there anything in the actual HTML that needs to be set? Thanks for any thoughts on this. -- Lee Aylward signature.asc Description: Digital signature ___ List: Catalyst@lists.scsys.co.uk Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/ Dev site: http://dev.catalyst.perl.org/
Re: [Catalyst] tips for troubleshooting/QAing Unicode (was Re: Passing UTF-8 arg in URL to DBIC search)
On Sun, Sep 28, 2008 at 08:41:09PM -0500, Lee Aylward wrote: Great timing on this as I am currently struggling with some unicode text not displaying correctly in an application I am working on. Per your suggestion I put the Japanese text at the top of my template. All of a sudden the browsers started displaying that and other non-ascii characters correctly. The second I take away the Japanese text it goes back to just showing question marks. I am seeing this behavior in both the test server and Apache. I have looked at the Content-Type header and it is definitely serving it as utf-8, so I am at abit of a loss. There are no databases involved here, but I am displaying information from IMDB::Film. Is there anything in the actual HTML that needs to be set? A little more info. I checked my page on the w3 validator and it returned this: Sorry, I am unable to validate this document because on line 245 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication. The error was: utf8 \xE9 does not map to Unicode http://validator.w3.org/check?uri=http%3A%2F%2Fprettybrd.com%2Ffilm%2Fperson%2F0001951charset=%28detect+automatically%29doctype=Inlinegroup=0 Perhaps my strings are getting encoded twice? I'll add the suggestion of trying the validator to the wiki page, but it would be nice to have a solution to this specific problem on there as well. -- Lee Aylward signature.asc Description: Digital signature ___ List: Catalyst@lists.scsys.co.uk Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/ Dev site: http://dev.catalyst.perl.org/
Re: [Catalyst] tips for troubleshooting/QAing Unicode (was Re: Passing UTF-8 arg in URL to DBIC search)
Lee Aylward wrote: Great timing on this as I am currently struggling with some unicode text not displaying correctly in an application I am working on. Per your suggestion I put the Japanese text at the top of my template. All of a sudden the browsers started displaying that and other non-ascii characters correctly. The second I take away the Japanese text it goes back to just showing question marks. I am seeing this behavior in both the test server and Apache. I have looked at the Content-Type header and it is definitely serving it as utf-8, so I am at abit of a loss. There are no databases involved here, but I am displaying information from IMDB::Film. Is there anything in the actual HTML that needs to be set? That seems strange. I wonder if something in your template handler or other part of your app is trying to DWIM for you and is getting it wrong. Are your source files actually UTF-8, both the prior and new versions? Are you explicitly declaring that in one place and not another? I wouldn't expect the addition of Japanese text to suddenly make the other characters look correct by itself unless there's some DWIM going on. I suspect you made some other change between the two versions as well, such as saving the source file in a different encoding. Note that the reason I use a Japanese text example is because the vast majority of my normal program text would fit in the ASCII repertoire, and it would only be user data that might be Unicode, though most user data isn't. And Japanese characters are known to not have a one-byte interpretation and they stand out clearly from latin letters at a glance. So in your own situation, the text you already have that doesn't display right, if it is literal text in your source code, should be a surrogate for my Japanese test example to see if things look right. So see what your text editor says that your older/incorrect file version's encoding is. -- Darren Duncan ___ List: Catalyst@lists.scsys.co.uk Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/ Dev site: http://dev.catalyst.perl.org/
[Catalyst] tips for troubleshooting/QAing Unicode (was Re: Passing UTF-8 arg in URL to DBIC search)
Maybe you're already aware of this, but I've found from experience that troubleshooting encoding/Unicode problems in a web/db app can be difficult, especially with multiple conversions at different stages, but I've come up with a short generic algorithm to help test/ensure that things are working and where things need fixing. Note that these details assuming we're using Perl 5.8+. 1. Make sure all your text/code/template/non-binary/etc files are saved as UTF-8 text files (or they are 7-bit ASCII), and you have a Unicode-savvy text editor. 2. Have a use utf8; at the top of every Perl file, so Perl treats your source files as being Unicode. 3. Place a text string literal in your program code that you know isn't in ASCII ... for example I like to use the word 'サンプル', which is what came out of Google's translation tool when I asked it to translate the word 'sample' to Japanese. Then setup your program to display that text directly in your web page text, without any escaping. 4. Make sure the HTTP response headers for the webpage with that text have a content-type charset value of UTF-8, and make sure that Perl is encoding its output as actual UTF-8; if you were doing it directly using STDOUT for example such as in a CGI, it could be: binmode *main::STDOUT, ':encoding(UTF-8)'; or such. Make sure your web browser is Unicode savvy. 5. At this point, if the web page displays correctly with the non-ASCII literal (and moreover, if you view source in the browser and the literal also displays literally), then you know your program can work/represent internally with Unicode correctly, and it can output Unicode correctly to the browser. It is very important to get this step working first, in isolation, so that you are in a position to judge or troubleshoot other issues such as receiving Unicode input from a browser or using it with a database. 6. Next test that you can receive Unicode from the browser in the various ways, whether by query string / http headers or in an http post. Eg try outputting a value and have the user submit it again, and compare for equality either in the Perl program or by displaying it again next to the original for visual inspection. If any differences come up, then you know any fixes you have to do concern either how you read and interpret the browser request, or perhaps on how you instruct the browser on how to submit a request. Once that's all cleared up, then you know your I/O with the web browser works fine. 7. To test a database, I suggest first using a known-good and Unicode savvy alternate input method for putting some Unicode text in the database, such as using an admin/utility tool that came with the DBMS. Also make sure that the database is itself using UTF-8 character strings in its schema, eg that the schema is declared this way. 8. With a database known to contain some valid Unicode etc text, you first test simply selecting that text from the database and displaying it. If anything doesn't match, it means you probably have to configure your DBMS client connection encoding so it is UTF-8 (often done with a few certain SQL commands), and then separately ensure that Perl is decoding the UTF-8 data into Perl text strings properly. Its important to make sure you can retrieve Unicode from the database properly so that you have a context for judging that you can insert such text in the database. 9. Next try to insert some Unicode text in the database using your program, then select it back to check that it worked. If it didn't, then check DBMS client connection settings, or that Perl is encoding text as UTF-8 properly. 10. Actually, when you have a known-good external tool to help you, you can alternately start the DBMS tests with step 9, where your program inserts text, then you use the known-good tool to ensure it actually was recorded properly. Anyway, that's it in a nutshell. Now I'm sure many of you have already figured this out, but for those who haven't, I hope these tips help you. Adjust as appropriate to account for any abstraction tools or frameworks you are using which means your tests may also involve testing those tools or configuring them. -- Darren Duncan Hugh Hunter wrote: I've been struggling with this for some time and know there must be an answer out there. I'm using URL arguments to pass parameters to my controller. It's a site about names, so take the url http://domain.com/name/Jesús (note the accented u). The Name.pm controller has an :Args(1) decorator so Jesús is stored in $name and then passed to my DBIC model in a -search({name = $name}) call. This doesn't manage to find the row that exists in mysql. When I dump $name I get: 'name' = 'Jes\xc3\xbas' which I think I understand as being perl's internal escaping of utf-8 characters. I've done everything recommended on http://dev.catalystframework.org/wiki/gettingstarted/tutorialsandhowtos/using_unicode and