subject:"\[Catalyst\] Patch for Catalyst\:\:Plugin\:\:Unicode\:\:Encoding"

Re: [Catalyst] Patch for Catalyst::Plugin::Unicode::Encoding

2008-03-19 Thread Bill Moseley

On Wed, Mar 19, 2008 at 12:36:42AM -0500, Jonathan Rockway wrote:
 * On Wed, Mar 19 2008, Jonathan Rockway wrote:
 
  We should not need to check the flag.  The incoming data should be
  encoded.  Then we should decode it.  Then we should not try to decode it
  again.
 
 A key thing I forgot to mention is that is_utf8 doesn't mean we tried
 to decode this already.  It means that the internal representation of
 the string is utf8.

Please don't say that. ;)

I think the fact that people know that Perl's *internal* encoding is
utf8 causes a lot of confusion.  I wish is_utf8 was aliased as
is_decoded or is_characters and all we knew was that Perl uses
some kind of hidden internal character representation.


This is an important plugin, and it is often cited as something to
use.  I'd argue that it should be part of Catalyst core and not a
plugin -- when do we not work with octets externally and characters
internally?

I think there's a number of problems with the plugin:


To be clear, the problem I posted about was the plugin *decoding*
already decoded content.  If Perl sees the utf8 flag is already
set it will die with:

Cannot decode string with wide characters at ...

But, you ask, why would the plugin see already decoded content?

This can happen if the body was parsed with, for example, Expat or
XML::LibXML -- or there's some other custom body parser that needs to
work in decoded characters before this plugin runs.

So, it's very possible that the plugin would see already decoded
content and thus should not attempt to decode again.


Unless someone does something stupid and forces the utf8 flag true,
the utf8 flag indicates that the content has been decoded. That's why
it makes perfect sense to test it before decoding to avoid the error.

True, the utf8 flag does not mean we have non-ascii or wide
characters.

$ perl -MEncode -wle \
'print utf8::is_utf8( Encode::decode_utf8( foo ))
? Is utf8\n
: Is not utf8\n'

Is utf8

But it does mean we don't have to call decode again.

Even if decoding ascii-only chars as utf8 did NOT set the utf8 flag
true then decoding as utf8 again is not a problem (thanks to ascii
mapping to utf8).

Testing is the utf8 flag before decoding is the fix.


Here's another thing this plugin does incorrectly.  The
current code is:

sub prepare_parameters {
my $c = shift;

$c-NEXT::prepare_parameters;

for my $value ( values %{ $c-request-{parameters} } ) {

[...]


It should be decoding *body_parameters*.

In other words, the result of that code above is:

length $c-req-parameters-{foo}
!= length $c-req-body_parameters-{foo}.

(or one parameter is utf8-flagged characters and one is a byte
string).


So, the fix is to override prepare_body_parameters to decode the
body params then $c-engine-prepare_parameter will merge the decoded
body params with the query params:


sub prepare_body_parameters {
my $c = shift;

# Read in the body params via the engine
$c-NEXT::prepare_body_parameters;

for my $value ( values %{ $c-request-body_parameters } ) {

# This breaks decoding if a param is a hash
# and Data::Visitor will return hash *keys* so that's no help
next if ref $value  ref $value ne 'ARRAY';

for ( ref( $value ) ? @{$value} : $value ) {

# Don't decode if already decoded.
next if Encode::is_utf8( $_ );

$_ = decode_utf8( $_, $CHECK );
}
}

return;
}

Now, the query parameters need fixing in the same way.

Where the code to decode body and query parameters should go is
debatable.  In some ways I think it should be in the Engine as that's
what is bringing the data into the application.  

Maybe the decoding could happen in Engine's prepare_parameters as it's
looking at both the body and query params -- but that runs the risk of
something using the query params before they are decoded.


Finally, the plugin does this:

unless ( $c-response-content_type =~ /^text|xml$|javascript$/ ) {
return $c-NEXT::finalize;
}

unless ( Encode::is_utf8( $c-response-body ) ) {
return $c-NEXT::finalize;
}

$c-response-body( $c-encoding-encode( $c-response-body, $CHECK ) );

That's a bit less clear if that's correct or not.  The is_utf8 does
serve a purpose, but I think it's implemented wrong.

Here's an example where that might be a fix: Your normal encoding is
utf8, but you also return XML and you have a client that, for some
reason, insists on a different encoding.  So you encode the body
(clearing the utf8 flag) before the plugin sees the body.  So the test
of is_utf8 above will keep from encoding the already-encoded byte
string.

Of course, what should happen is the output encoding should be
determined by looking at the Accept-Charset and maybe send a 406 if
can't find a suitable encoding.


-- 
Bill Moseley
[EMAIL PROTECTED]


___
List:

Re: [Catalyst] Patch for Catalyst::Plugin::Unicode::Encoding

2008-03-18 Thread Jonathan Rockway

* On Tue, Mar 18 2008, Bill Moseley wrote:
 The plugin decodes all parameters using:

 $_ = $c-encoding-decode( $_, $CHECK ) for ( ref($value) ? @{$value} : 
 $value );


 I'd think it would be wise to check to see if the string is already
 decoded:

 for ( ref($value) ? @{$value} : $value ) {
 next if Encode::is_utf8($_);
 $_ = $c-encoding-decode( $_, $CHECK );
 }

Never check is_utf8.  Encode will do the right thing regardless of
the internal representation of the string.

  http://blog.jrock.us/articles/Fuck%20the%20internal%20representation.pod

Regards,
Jonathan Rockway

-- 
print just = another = perl = hacker = if $,=$

___
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/

Re: [Catalyst] Patch for Catalyst::Plugin::Unicode::Encoding

2008-03-18 Thread Bill Moseley

On Tue, Mar 18, 2008 at 03:38:06AM -0500, Jonathan Rockway wrote:
 * On Tue, Mar 18 2008, Bill Moseley wrote:
  The plugin decodes all parameters using:
 
  $_ = $c-encoding-decode( $_, $CHECK ) for ( ref($value) ? @{$value} : 
  $value );
 
 
  I'd think it would be wise to check to see if the string is already
  decoded:
 
  for ( ref($value) ? @{$value} : $value ) {
  next if Encode::is_utf8($_);
  $_ = $c-encoding-decode( $_, $CHECK );
  }
 
 Never check is_utf8.  Encode will do the right thing regardless of
 the internal representation of the string.


This is decoding, not encoding.

Without testing is_utf8 and if the parameters are already decoded I 
rightly get:

[error] Caught exception in engine Cannot decode string with wide
characters at
/usr/local/share/perl/5.8.8/Catalyst/Plugin/Unicode/Encoding.pm
line 74.


   http://blog.jrock.us/articles/Fuck%20the%20internal%20representation.pod

Yes, checking is_ut8 before encoding doesn't make sense.


-- 
Bill Moseley
[EMAIL PROTECTED]


___
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/

Re: [Catalyst] Patch for Catalyst::Plugin::Unicode::Encoding

2008-03-18 Thread Jonas Alves

On Tue, Mar 18, 2008 at 11:01 AM, Bill Moseley [EMAIL PROTECTED] wrote:
 On Tue, Mar 18, 2008 at 03:38:06AM -0500, Jonathan Rockway wrote:
   * On Tue, Mar 18 2008, Bill Moseley wrote:
The plugin decodes all parameters using:
   
$_ = $c-encoding-decode( $_, $CHECK ) for ( ref($value) ? 
 @{$value} : $value );
   
   
I'd think it would be wise to check to see if the string is already
decoded:
   
for ( ref($value) ? @{$value} : $value ) {
next if Encode::is_utf8($_);
$_ = $c-encoding-decode( $_, $CHECK );
}
  
   Never check is_utf8.  Encode will do the right thing regardless of
   the internal representation of the string.
  

  This is decoding, not encoding.

  Without testing is_utf8 and if the parameters are already decoded I
  rightly get:

 [error] Caught exception in engine Cannot decode string with wide
 characters at
 /usr/local/share/perl/5.8.8/Catalyst/Plugin/Unicode/Encoding.pm
 line 74.



 http://blog.jrock.us/articles/Fuck%20the%20internal%20representation.pod

  Yes, checking is_ut8 before encoding doesn't make sense.



No, you never should check the flag. See Miyagawa's post
http://use.perl.org/~miyagawa/journal/35700.

-- 
Jonas

___
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/

Re: [Catalyst] Patch for Catalyst::Plugin::Unicode::Encoding

2008-03-18 Thread Bill Moseley

On Wed, Mar 19, 2008 at 12:04:37AM +, Jonas Alves wrote:
 
 
 No, you never should check the flag. See Miyagawa's post
 http://use.perl.org/~miyagawa/journal/35700.

You quoted the entire message -- what point where you commenting on.

-- 
Bill Moseley
[EMAIL PROTECTED]


___
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/

Re: [Catalyst] Patch for Catalyst::Plugin::Unicode::Encoding

2008-03-18 Thread Jonathan Rockway

* On Tue, Mar 18 2008, Bill Moseley wrote:

 Without testing is_utf8 and if the parameters are already decoded I 
 rightly get:

 [error] Caught exception in engine Cannot decode string with wide
 characters at
 /usr/local/share/perl/5.8.8/Catalyst/Plugin/Unicode/Encoding.pm
 line 74.

What this means is we don't know what the fuck the encoding is in the
incoming data, so just try random shit until something appears to
work.  I would prefer a more thought-out approach.

We should not need to check the flag.  The incoming data should be
encoded.  Then we should decode it.  Then we should not try to decode it
again.

Regards,
Jonathan Rockway

-- 
print just = another = perl = hacker = if $,=$

___
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/

Re: [Catalyst] Patch for Catalyst::Plugin::Unicode::Encoding

2008-03-18 Thread Jonathan Rockway

* On Wed, Mar 19 2008, Jonathan Rockway wrote:

 We should not need to check the flag.  The incoming data should be
 encoded.  Then we should decode it.  Then we should not try to decode it
 again.

A key thing I forgot to mention is that is_utf8 doesn't mean we tried
to decode this already.  It means that the internal representation of
the string is utf8.

Regards,
Jonathan Rockway

-- 
print just = another = perl = hacker = if $,=$

___
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/

[Catalyst] Patch for Catalyst::Plugin::Unicode::Encoding

2008-03-17 Thread Bill Moseley

The plugin decodes all parameters using:

$_ = $c-encoding-decode( $_, $CHECK ) for ( ref($value) ? @{$value} : 
$value );


I'd think it would be wise to check to see if the string is already
decoded:

for ( ref($value) ? @{$value} : $value ) {
next if Encode::is_utf8($_);
$_ = $c-encoding-decode( $_, $CHECK );
}

See any reason not to do that?

-- 
Bill Moseley
[EMAIL PROTECTED]


___
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/

Re: [Catalyst] Patch for Catalyst::Plugin::Unicode::Encoding

Re: [Catalyst] Patch for Catalyst::Plugin::Unicode::Encoding

Re: [Catalyst] Patch for Catalyst::Plugin::Unicode::Encoding

Re: [Catalyst] Patch for Catalyst::Plugin::Unicode::Encoding

Re: [Catalyst] Patch for Catalyst::Plugin::Unicode::Encoding

Re: [Catalyst] Patch for Catalyst::Plugin::Unicode::Encoding

Re: [Catalyst] Patch for Catalyst::Plugin::Unicode::Encoding

[Catalyst] Patch for Catalyst::Plugin::Unicode::Encoding

8 matches

Site Navigation

Mail list logo

Footer information