Christian and Porters,

Thanks for your report.

On Sep 15, 2005, at 18:53 , Christian Jaeger (via RT) wrote:
 - the file being written to disk *does* contain utf8 sequences.
- the flag being written to disk is false. (Encode::is_utf8 gave false)
 - the length being written into the header is too short (which
   means that the length builtin reported the length in unicode code
   points, not bytes -- how can this be if Encode::is_utf8 is false?).

I could not duplicate the symptom on perl 5.8.7 but on 5.8.6 I did.

#
use strict;
use Encode;
my $fn = 'test.txt';
sub readwrite{
    my $str = shift;
    open my $fh, ">:utf8", $fn or die "$fn : $!";
    print $fh $str;
    close $fh;
    open my $fh, "<:raw", $fn or die "$fn : $!";
    read $fh, my $buf, -s $fn;
    close $fh; unlink $fn;
    return $buf;
}
sub checkstr{
    my $str = shift;
    print "Encode::is_utf8(\$str) = ", Encode::is_utf8($str), "\n";
    print "utf8::is_utf8(\$str) = ",   utf8::is_utf8($str), "\n";
}
my $ascii = join '', map { chr $_ } 0x20..0x7e; # only ascii
my $utf8  = join '',  map { chr $_ } 0x2020..0x207e; # now Unicode;
checkstr(decode_utf8(readwrite $ascii));
checkstr(decode_utf8(readwrite $utf8));
__END__

you run the code as follows (on my Mac OS X v10.4.2);

% /usr/bin/perl utf8flag.pl
Perl Version is 5.008006, Encode Version is 2.08
Encode::is_utf8($str) =
utf8::is_utf8($str) =
Encode::is_utf8($str) = 1
utf8::is_utf8($str) = 1
% /usr/bin/perl -T utf8flag.pl
Perl Version is 5.008006, Encode Version is 2.08
Encode::is_utf8($str) =
utf8::is_utf8($str) =
Encode::is_utf8($str) =
utf8::is_utf8($str) = 1
% perl  utf8flag.pl
Perl Version is 5.008007, Encode Version is 2.10
Encode::is_utf8($str) = 1
utf8::is_utf8($str) = 1
Encode::is_utf8($str) = 1
utf8::is_utf8($str) = 1
% perl -T utf8flag.pl
Perl Version is 5.008007, Encode Version is 2.10
Encode::is_utf8($str) = 1
utf8::is_utf8($str) = 1
Encode::is_utf8($str) = 1
utf8::is_utf8($str) = 1

As you see, on 5.8.6 utf8::is_utf8() works fine while Encode::is_utf8 () does not. Also note on 5.8.7 the flag is set UNCONDITIONALLY, whether the string contains U+100 and above or not.

/* universal.c */
XS(XS_utf8_is_utf8)
{
     dXSARGS;
     if (items != 1)
          Perl_croak(aTHX_ "Usage: utf8::is_utf8(sv)");
     {
          SV *  sv = ST(0);
          {
               if (SvUTF8(sv))
                    XSRETURN_YES;
               else
                    XSRETURN_NO;
          }
     }
     XSRETURN_EMPTY;
}
/* end of code */

/* ext/Encode/Encode.xs */
bool
is_utf8(sv, check = 0)
SV *    sv
int     check
CODE:
{
    if (SvGMAGICAL(sv)) /* it could be $1, for example */
        sv = newSVsv(sv); /* GMAGIG will be done */
    if (SvPOK(sv)) {
        RETVAL = SvUTF8(sv) ? TRUE : FALSE;
        if (RETVAL &&
            check  &&
            !is_utf8_string((U8*)SvPVX(sv), SvCUR(sv)))
            RETVAL = FALSE;
    } else {
        RETVAL = FALSE;
    }
    if (sv != ST(0))
        SvREFCNT_dec(sv); /* it was a temp copy */
}
OUTPUT:
    RETVAL

/* end of code */

Though not harmful, the behavior of 5.8.7 is not as documented as in Encode. Should I fix the pod accordingly or did it just reveal undocumented bug?

Dan the Encode Maintainer


Reply via email to