The Encode manpage says this about FB_QUIET: | CHECK = Encode::FB_QUIET | | If CHECK is set to Encode::FB_QUIET, (en|de)code will | immediately return the portion of the data that has been | processed so far when an error occurs. The data argument will | be overwritten with everything after that point (that is, the | unprocessed part of data). This is handy when you have to | call decode repeatedly in the case where your source data may | contain partial multi-byte character sequences, for example | because you are reading with a fixed-width buffer. Here is | some sample code that does exactly this: | | my $data = ''; my $utf8 = ''; | while(defined(read $fh, $buffer, 256)){ | # buffer may end in a partial character so we append | $data .= $buffer; | $utf8 .= decode($encoding, $data, Encode::FB_QUIET); | # $data now contains the unprocessed partial character | }
First off this sample code is no good since this loop will normally never terminate as read() only returns undef on failure and EOF is not a failure. Second we will end up accumulating the resf of the file in $buffer as soon as we encounter a bad byte in the stream. We need to distinguish between bad stuff and incomplete sequences. Also note that an incomplete sequences at EOF is bad stuff. I believe this function will do the right thing: use Encode; sub read_utf8 { my($fh, $bad_byte_cb) = @_; my $str = ""; # where we accumulate the result my $buf = ""; my $n; do { $n = read($fh, $buf, 16, length($buf)); die "Can't read: $!" unless defined $n; while (length $buf) { $str .= Encode::decode("UTF-8", $buf, Encode::FB_QUIET); last if $n && length($buf) < 4; # possibly an incomplete char if (length($buf)) { my $bad_byte = substr($buf, 0, 1, ""); $str .= &$bad_byte_cb(ord($bad_byte)) if $bad_byte_cb; } } } while $n; return $str; } # test it use Data::Dump; print Data::Dump::dump(read_utf8(*STDIN, sub { sprintf "%%%02X", shift })), "\n"; so I suggest adding this as a example to the documentation. What I don't like here is the test for incomplete char. What I really want is for Encode::decode() to tell me what the situation is, so I want to extend its API. The simplest way seems to just add another argument that is updated to reflect this status. Encode::decode("UTF-8", $buf, Encode::FB_QUIET, $incomplete); where $incomplete will be TRUE iff there is stuff left in $buf and the reason is that more data is needed to decode properly. Is this an acceptable extension? Regards, Gisle