Re: Output filters, data encoding

tomcat/perl Wed, 13 Nov 2019 13:41:53 -0800

On 13.11.2019 19:37, Damyan Ivanov wrote:

-=| André Warnier (tomcat/perl), 13.11.2019 19:12:10 +0100 |=-

        while (my $sz = $f->read(my $buffer, BUFF_LEN)) {
..


and then I need to pass this data to another module for processing 
(Template::Toolkit).
To make a long story short, Template::Toolkit misinterprets the data I'm
sending to it, because this data /is/ actually UTF-8, but apparently not
marked so internally by the $f->read(). So TT2 re-encodes it, leading to
double UTF-8 encoding.

My question is : can I - and how -, set the filehandle that corresponds to
the $f->read(), to a UTF-8 layer ?
I have tried

line 155: binmode($f,'encoding:(UTF-8)');

and that triggers an error :
  Not a GLOB reference at (my filter) line 155.\n
)

Or do I need to read the data 'as is', and separately do an

  $decoded_buffer = decode('UTF-8', $buffer);


There's a middle ground - partial decoding. See Encode(1)/FB_QUIET:

        If CHECK is set to "Encode::FB_QUIET", encoding and decoding
        immediately return the portion of the data that has been processed so
        far when an error occurs. The data argument is overwritten with
        everything after that point; that is, the unprocessed portion of the
        data.  This is handy when you have to call "decode" repeatedly in the
        case where your source data may contain partial multi-byte character
        sequences, (that is, you are reading with a fixed-width buffer). Here's
        some sample code to do exactly that:

            my($buffer, $string) = ("", "");
            while (read($fh, $buffer, 256, length($buffer))) {
                $string .= decode($encoding, $buffer, Encode::FB_QUIET);
                # $buffer now contains the unprocessed partial character
            }

Looks exactly like your case.

Thanks for the response and the tip.

My idea of adding a UTF-8 layer to the filehandle through which Apache2::Filter reads theincoming data was probably wrong anyway : it cannot do that, because it gets this dataoriginally in chunks, as "bucket brigades" from Apache httpd. And there is no guaranteethat such a bucket brigade would always end in "complete" UTF-8 character sequences.

At the very least, this would probably complicate the code underlying 
$f->read() quite a bit.
It is clearer to handle that in the filter itself.

The Encode::FB_QUIET flag above, with the incremental buffer read, is really 
smart.

Unfortunately, the Apache2::Filter read() method does not allow as many arguments, and allone has is something like this :


        my $accumulated_content = "";
        while (my $sz = $f->read(my $buffer, BUFF_LEN)) {
                $accumulated_content .= $buffer;
        }

Luckily, in this case, I have to accumulate the complete response content anyway, before Ican decide to call Template::Toolkit on it or not. So I can do a single decode() on$accumulated_content. Not the most efficient memory-wise, but good enough in this case.

Re: Output filters, data encoding

Reply via email to