Re: [Catalyst] tips for troubleshooting/QAing Unicode (was Re: Passing UTF-8 arg in URL to DBIC search)

2008-09-28 Thread J. Shirley
On Sat, Sep 27, 2008 at 3:39 PM, Darren Duncan [EMAIL PROTECTED] wrote:
 Maybe you're already aware of this, but I've found from experience that
 troubleshooting encoding/Unicode problems in a web/db app can be difficult,
 especially with multiple conversions at different stages, but I've come up
 with a short generic algorithm to help test/ensure that things are working
 and where things need fixing.  Note that these details assuming we're using
 Perl 5.8+.
  [ snip ]

Hey Darren, great post!

Can you post it on the wiki, perhaps at:

http://dev.catalystframework.org/wiki/faq link to Unicode
Troubleshooting in the Unicode section there?  It would be much
appreciated.

Thanks,
-J

___
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/


Re: [Catalyst] tips for troubleshooting/QAing Unicode (was Re: Passing UTF-8 arg in URL to DBIC search)

2008-09-28 Thread Lee Aylward
On Sat, Sep 27, 2008 at 03:39:03PM -0700, Darren Duncan wrote:
 Maybe you're already aware of this, but I've found from experience that  
 troubleshooting encoding/Unicode problems in a web/db app can be 
 difficult, especially with multiple conversions at different stages, but 
 I've come up with a short generic algorithm to help test/ensure that 
 things are working and where things need fixing.  Note that these details 
 assuming we're using Perl 5.8+.

 ... lots of good tips...

Great timing on this as I am currently struggling with some unicode text
not displaying correctly in an application I am working on. Per your
suggestion I put the Japanese text at the top of my template. All of a
sudden the browsers started displaying that and other non-ascii characters
correctly. The second I take away the Japanese text it goes back to just
showing question marks. I am seeing this behavior in both the test
server and Apache.

I have looked at the Content-Type header and it is definitely serving it
as utf-8, so I am at abit of a loss. There are no databases involved
here, but I am displaying information from IMDB::Film. Is there anything
in the actual HTML that needs to be set?

Thanks for any thoughts on this.
-- 
Lee Aylward


signature.asc
Description: Digital signature
___
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/


Re: [Catalyst] tips for troubleshooting/QAing Unicode (was Re: Passing UTF-8 arg in URL to DBIC search)

2008-09-28 Thread Lee Aylward
On Sun, Sep 28, 2008 at 08:41:09PM -0500, Lee Aylward wrote:
 
 Great timing on this as I am currently struggling with some unicode text
 not displaying correctly in an application I am working on. Per your
 suggestion I put the Japanese text at the top of my template. All of a
 sudden the browsers started displaying that and other non-ascii characters
 correctly. The second I take away the Japanese text it goes back to just
 showing question marks. I am seeing this behavior in both the test
 server and Apache.
 
 I have looked at the Content-Type header and it is definitely serving it
 as utf-8, so I am at abit of a loss. There are no databases involved
 here, but I am displaying information from IMDB::Film. Is there anything
 in the actual HTML that needs to be set?
 

A little more info. I checked my page on the w3 validator and it
returned this:

 Sorry, I am unable to validate this document because on line 245  it
 contained one or more bytes that I cannot interpret as utf-8  (in other
 words, the bytes found are not valid values in the specified Character
 Encoding). Please check both the content of the file and the character
 encoding indication.

 The error was: utf8 \xE9 does not map to Unicode 

 
http://validator.w3.org/check?uri=http%3A%2F%2Fprettybrd.com%2Ffilm%2Fperson%2F0001951charset=%28detect+automatically%29doctype=Inlinegroup=0

Perhaps my strings are getting encoded twice? I'll add the suggestion of
trying the validator to the wiki page, but it would be nice to have a
solution to this specific problem on there as well.

-- 
Lee Aylward


signature.asc
Description: Digital signature
___
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/


Re: [Catalyst] tips for troubleshooting/QAing Unicode (was Re: Passing UTF-8 arg in URL to DBIC search)

2008-09-28 Thread Darren Duncan

Lee Aylward wrote:

Great timing on this as I am currently struggling with some unicode text
not displaying correctly in an application I am working on. Per your
suggestion I put the Japanese text at the top of my template. All of a
sudden the browsers started displaying that and other non-ascii characters
correctly. The second I take away the Japanese text it goes back to just
showing question marks. I am seeing this behavior in both the test
server and Apache.

I have looked at the Content-Type header and it is definitely serving it
as utf-8, so I am at abit of a loss. There are no databases involved
here, but I am displaying information from IMDB::Film. Is there anything
in the actual HTML that needs to be set?


That seems strange.  I wonder if something in your template handler or 
other part of your app is trying to DWIM for you and is getting it wrong. 
Are your source files actually UTF-8, both the prior and new versions?  Are 
you explicitly declaring that in one place and not another?  I wouldn't 
expect the addition of Japanese text to suddenly make the other characters 
look correct by itself unless there's some DWIM going on.  I suspect you 
made some other change between the two versions as well, such as saving the 
source file in a different encoding.


Note that the reason I use a Japanese text example is because the vast 
majority of my normal program text would fit in the ASCII repertoire, and 
it would only be user data that might be Unicode, though most user data 
isn't.  And Japanese characters are known to not have a one-byte 
interpretation and they stand out clearly from latin letters at a glance. 
So in your own situation, the text you already have that doesn't display 
right, if it is literal text in your source code, should be a surrogate for 
my Japanese test example to see if things look right.  So see what your 
text editor says that your older/incorrect file version's encoding is.


-- Darren Duncan

___
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/


[Catalyst] tips for troubleshooting/QAing Unicode (was Re: Passing UTF-8 arg in URL to DBIC search)

2008-09-27 Thread Darren Duncan
Maybe you're already aware of this, but I've found from experience that 
troubleshooting encoding/Unicode problems in a web/db app can be difficult, 
especially with multiple conversions at different stages, but I've come up 
with a short generic algorithm to help test/ensure that things are working 
and where things need fixing.  Note that these details assuming we're using 
Perl 5.8+.


1. Make sure all your text/code/template/non-binary/etc files are saved as 
UTF-8 text files (or they are 7-bit ASCII), and you have a Unicode-savvy 
text editor.


2. Have a use utf8; at the top of every Perl file, so Perl treats your 
source files as being Unicode.


3. Place a text string literal in your program code that you know isn't in 
ASCII ... for example I like to use the word 'サンプル', which is what came 
out of Google's translation tool when I asked it to translate the word 
'sample' to Japanese.  Then setup your program to display that text 
directly in your web page text, without any escaping.


4. Make sure the HTTP response headers for the webpage with that text have 
a content-type charset value of UTF-8, and make sure that Perl is encoding 
its output as actual UTF-8; if you were doing it directly using STDOUT for 
example such as in a CGI, it could be: binmode *main::STDOUT, 
':encoding(UTF-8)'; or such.  Make sure your web browser is Unicode savvy.


5. At this point, if the web page displays correctly with the non-ASCII 
literal (and moreover, if you view source in the browser and the literal 
also displays literally), then you know your program can work/represent 
internally with Unicode correctly, and it can output Unicode correctly to 
the browser.  It is very important to get this step working first, in 
isolation, so that you are in a position to judge or troubleshoot other 
issues such as receiving Unicode input from a browser or using it with a 
database.


6. Next test that you can receive Unicode from the browser in the various 
ways, whether by query string / http headers or in an http post.  Eg try 
outputting a value and have the user submit it again, and compare for 
equality either in the Perl program or by displaying it again next to the 
original for visual inspection.  If any differences come up, then you know 
any fixes you have to do concern either how you read and interpret the 
browser request, or perhaps on how you instruct the browser on how to 
submit a request.  Once that's all cleared up, then you know your I/O with 
the web browser works fine.


7. To test a database, I suggest first using a known-good and Unicode savvy 
alternate input method for putting some Unicode text in the database, such 
as using an admin/utility tool that came with the DBMS.  Also make sure 
that the database is itself using UTF-8 character strings in its schema, eg 
that the schema is declared this way.


8. With a database known to contain some valid Unicode etc text, you first 
test simply selecting that text from the database and displaying it.  If 
anything doesn't match, it means you probably have to configure your DBMS 
client connection encoding so it is UTF-8 (often done with a few certain 
SQL commands), and then separately ensure that Perl is decoding the UTF-8 
data into Perl text strings properly.  Its important to make sure you can 
retrieve Unicode from the database properly so that you have a context for 
judging that you can insert such text in the database.


9. Next try to insert some Unicode text in the database using your program, 
then select it back to check that it worked.  If it didn't, then check DBMS 
client connection settings, or that Perl is encoding text as UTF-8 properly.


10. Actually, when you have a known-good external tool to help you, you can 
alternately start the DBMS tests with step 9, where your program inserts 
text, then you use the known-good tool to ensure it actually was recorded 
properly.


Anyway, that's it in a nutshell.  Now I'm sure many of you have already 
figured this out, but for those who haven't, I hope these tips help you. 
Adjust as appropriate to account for any abstraction tools or frameworks 
you are using which means your tests may also involve testing those tools 
or configuring them.


-- Darren Duncan

Hugh Hunter wrote:
I've been struggling with this for some time and know there must be an 
answer out there.


I'm using URL arguments to pass parameters to my controller.  It's a 
site about names, so take the url http://domain.com/name/Jesús (note the 
accented u).  The Name.pm controller has an :Args(1) decorator so Jesús 
is stored in $name and then passed to my DBIC model in a -search({name 
= $name}) call.  This doesn't manage to find the row that exists in 
mysql.  When I dump $name I get:


'name' = 'Jes\xc3\xbas'

which I think I understand as being perl's internal escaping of utf-8 
characters.


I've done everything recommended on 
http://dev.catalystframework.org/wiki/gettingstarted/tutorialsandhowtos/using_unicode and