Re: [Catalyst] Patch for Catalyst::Plugin::Unicode::Encoding
On Wed, Mar 19, 2008 at 12:36:42AM -0500, Jonathan Rockway wrote: * On Wed, Mar 19 2008, Jonathan Rockway wrote: We should not need to check the flag. The incoming data should be encoded. Then we should decode it. Then we should not try to decode it again. A key thing I forgot to mention is that is_utf8 doesn't mean we tried to decode this already. It means that the internal representation of the string is utf8. Please don't say that. ;) I think the fact that people know that Perl's *internal* encoding is utf8 causes a lot of confusion. I wish is_utf8 was aliased as is_decoded or is_characters and all we knew was that Perl uses some kind of hidden internal character representation. This is an important plugin, and it is often cited as something to use. I'd argue that it should be part of Catalyst core and not a plugin -- when do we not work with octets externally and characters internally? I think there's a number of problems with the plugin: To be clear, the problem I posted about was the plugin *decoding* already decoded content. If Perl sees the utf8 flag is already set it will die with: Cannot decode string with wide characters at ... But, you ask, why would the plugin see already decoded content? This can happen if the body was parsed with, for example, Expat or XML::LibXML -- or there's some other custom body parser that needs to work in decoded characters before this plugin runs. So, it's very possible that the plugin would see already decoded content and thus should not attempt to decode again. Unless someone does something stupid and forces the utf8 flag true, the utf8 flag indicates that the content has been decoded. That's why it makes perfect sense to test it before decoding to avoid the error. True, the utf8 flag does not mean we have non-ascii or wide characters. $ perl -MEncode -wle \ 'print utf8::is_utf8( Encode::decode_utf8( foo )) ? Is utf8\n : Is not utf8\n' Is utf8 But it does mean we don't have to call decode again. Even if decoding ascii-only chars as utf8 did NOT set the utf8 flag true then decoding as utf8 again is not a problem (thanks to ascii mapping to utf8). Testing is the utf8 flag before decoding is the fix. Here's another thing this plugin does incorrectly. The current code is: sub prepare_parameters { my $c = shift; $c-NEXT::prepare_parameters; for my $value ( values %{ $c-request-{parameters} } ) { [...] It should be decoding *body_parameters*. In other words, the result of that code above is: length $c-req-parameters-{foo} != length $c-req-body_parameters-{foo}. (or one parameter is utf8-flagged characters and one is a byte string). So, the fix is to override prepare_body_parameters to decode the body params then $c-engine-prepare_parameter will merge the decoded body params with the query params: sub prepare_body_parameters { my $c = shift; # Read in the body params via the engine $c-NEXT::prepare_body_parameters; for my $value ( values %{ $c-request-body_parameters } ) { # This breaks decoding if a param is a hash # and Data::Visitor will return hash *keys* so that's no help next if ref $value ref $value ne 'ARRAY'; for ( ref( $value ) ? @{$value} : $value ) { # Don't decode if already decoded. next if Encode::is_utf8( $_ ); $_ = decode_utf8( $_, $CHECK ); } } return; } Now, the query parameters need fixing in the same way. Where the code to decode body and query parameters should go is debatable. In some ways I think it should be in the Engine as that's what is bringing the data into the application. Maybe the decoding could happen in Engine's prepare_parameters as it's looking at both the body and query params -- but that runs the risk of something using the query params before they are decoded. Finally, the plugin does this: unless ( $c-response-content_type =~ /^text|xml$|javascript$/ ) { return $c-NEXT::finalize; } unless ( Encode::is_utf8( $c-response-body ) ) { return $c-NEXT::finalize; } $c-response-body( $c-encoding-encode( $c-response-body, $CHECK ) ); That's a bit less clear if that's correct or not. The is_utf8 does serve a purpose, but I think it's implemented wrong. Here's an example where that might be a fix: Your normal encoding is utf8, but you also return XML and you have a client that, for some reason, insists on a different encoding. So you encode the body (clearing the utf8 flag) before the plugin sees the body. So the test of is_utf8 above will keep from encoding the already-encoded byte string. Of course, what should happen is the output encoding should be determined by looking at the Accept-Charset and maybe send a 406 if can't find a suitable encoding. -- Bill Moseley [EMAIL PROTECTED] ___ List:
Re: [Catalyst] Patch for Catalyst::Plugin::Unicode::Encoding
* On Tue, Mar 18 2008, Bill Moseley wrote: The plugin decodes all parameters using: $_ = $c-encoding-decode( $_, $CHECK ) for ( ref($value) ? @{$value} : $value ); I'd think it would be wise to check to see if the string is already decoded: for ( ref($value) ? @{$value} : $value ) { next if Encode::is_utf8($_); $_ = $c-encoding-decode( $_, $CHECK ); } Never check is_utf8. Encode will do the right thing regardless of the internal representation of the string. http://blog.jrock.us/articles/Fuck%20the%20internal%20representation.pod Regards, Jonathan Rockway -- print just = another = perl = hacker = if $,=$ ___ List: Catalyst@lists.scsys.co.uk Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/ Dev site: http://dev.catalyst.perl.org/
Re: [Catalyst] Patch for Catalyst::Plugin::Unicode::Encoding
On Tue, Mar 18, 2008 at 03:38:06AM -0500, Jonathan Rockway wrote: * On Tue, Mar 18 2008, Bill Moseley wrote: The plugin decodes all parameters using: $_ = $c-encoding-decode( $_, $CHECK ) for ( ref($value) ? @{$value} : $value ); I'd think it would be wise to check to see if the string is already decoded: for ( ref($value) ? @{$value} : $value ) { next if Encode::is_utf8($_); $_ = $c-encoding-decode( $_, $CHECK ); } Never check is_utf8. Encode will do the right thing regardless of the internal representation of the string. This is decoding, not encoding. Without testing is_utf8 and if the parameters are already decoded I rightly get: [error] Caught exception in engine Cannot decode string with wide characters at /usr/local/share/perl/5.8.8/Catalyst/Plugin/Unicode/Encoding.pm line 74. http://blog.jrock.us/articles/Fuck%20the%20internal%20representation.pod Yes, checking is_ut8 before encoding doesn't make sense. -- Bill Moseley [EMAIL PROTECTED] ___ List: Catalyst@lists.scsys.co.uk Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/ Dev site: http://dev.catalyst.perl.org/
Re: [Catalyst] Patch for Catalyst::Plugin::Unicode::Encoding
On Tue, Mar 18, 2008 at 11:01 AM, Bill Moseley [EMAIL PROTECTED] wrote: On Tue, Mar 18, 2008 at 03:38:06AM -0500, Jonathan Rockway wrote: * On Tue, Mar 18 2008, Bill Moseley wrote: The plugin decodes all parameters using: $_ = $c-encoding-decode( $_, $CHECK ) for ( ref($value) ? @{$value} : $value ); I'd think it would be wise to check to see if the string is already decoded: for ( ref($value) ? @{$value} : $value ) { next if Encode::is_utf8($_); $_ = $c-encoding-decode( $_, $CHECK ); } Never check is_utf8. Encode will do the right thing regardless of the internal representation of the string. This is decoding, not encoding. Without testing is_utf8 and if the parameters are already decoded I rightly get: [error] Caught exception in engine Cannot decode string with wide characters at /usr/local/share/perl/5.8.8/Catalyst/Plugin/Unicode/Encoding.pm line 74. http://blog.jrock.us/articles/Fuck%20the%20internal%20representation.pod Yes, checking is_ut8 before encoding doesn't make sense. No, you never should check the flag. See Miyagawa's post http://use.perl.org/~miyagawa/journal/35700. -- Jonas ___ List: Catalyst@lists.scsys.co.uk Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/ Dev site: http://dev.catalyst.perl.org/
Re: [Catalyst] Patch for Catalyst::Plugin::Unicode::Encoding
On Wed, Mar 19, 2008 at 12:04:37AM +, Jonas Alves wrote: No, you never should check the flag. See Miyagawa's post http://use.perl.org/~miyagawa/journal/35700. You quoted the entire message -- what point where you commenting on. -- Bill Moseley [EMAIL PROTECTED] ___ List: Catalyst@lists.scsys.co.uk Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/ Dev site: http://dev.catalyst.perl.org/
Re: [Catalyst] Patch for Catalyst::Plugin::Unicode::Encoding
* On Tue, Mar 18 2008, Bill Moseley wrote: Without testing is_utf8 and if the parameters are already decoded I rightly get: [error] Caught exception in engine Cannot decode string with wide characters at /usr/local/share/perl/5.8.8/Catalyst/Plugin/Unicode/Encoding.pm line 74. What this means is we don't know what the fuck the encoding is in the incoming data, so just try random shit until something appears to work. I would prefer a more thought-out approach. We should not need to check the flag. The incoming data should be encoded. Then we should decode it. Then we should not try to decode it again. Regards, Jonathan Rockway -- print just = another = perl = hacker = if $,=$ ___ List: Catalyst@lists.scsys.co.uk Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/ Dev site: http://dev.catalyst.perl.org/
Re: [Catalyst] Patch for Catalyst::Plugin::Unicode::Encoding
* On Wed, Mar 19 2008, Jonathan Rockway wrote: We should not need to check the flag. The incoming data should be encoded. Then we should decode it. Then we should not try to decode it again. A key thing I forgot to mention is that is_utf8 doesn't mean we tried to decode this already. It means that the internal representation of the string is utf8. Regards, Jonathan Rockway -- print just = another = perl = hacker = if $,=$ ___ List: Catalyst@lists.scsys.co.uk Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/ Dev site: http://dev.catalyst.perl.org/
[Catalyst] Patch for Catalyst::Plugin::Unicode::Encoding
The plugin decodes all parameters using: $_ = $c-encoding-decode( $_, $CHECK ) for ( ref($value) ? @{$value} : $value ); I'd think it would be wise to check to see if the string is already decoded: for ( ref($value) ? @{$value} : $value ) { next if Encode::is_utf8($_); $_ = $c-encoding-decode( $_, $CHECK ); } See any reason not to do that? -- Bill Moseley [EMAIL PROTECTED] ___ List: Catalyst@lists.scsys.co.uk Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/ Dev site: http://dev.catalyst.perl.org/