Christian and Porters,
Thanks for your report.
On Sep 15, 2005, at 18:53 , Christian Jaeger (via RT) wrote:
- the file being written to disk *does* contain utf8 sequences.
- the flag being written to disk is false. (Encode::is_utf8 gave
false)
- the length being written into the header is too short (which
means that the length builtin reported the length in unicode code
points, not bytes -- how can this be if Encode::is_utf8 is false?).
I could not duplicate the symptom on perl 5.8.7 but on 5.8.6 I did.
#
use strict;
use Encode;
my $fn = 'test.txt';
sub readwrite{
my $str = shift;
open my $fh, ">:utf8", $fn or die "$fn : $!";
print $fh $str;
close $fh;
open my $fh, "<:raw", $fn or die "$fn : $!";
read $fh, my $buf, -s $fn;
close $fh; unlink $fn;
return $buf;
}
sub checkstr{
my $str = shift;
print "Encode::is_utf8(\$str) = ", Encode::is_utf8($str), "\n";
print "utf8::is_utf8(\$str) = ", utf8::is_utf8($str), "\n";
}
my $ascii = join '', map { chr $_ } 0x20..0x7e; # only ascii
my $utf8 = join '', map { chr $_ } 0x2020..0x207e; # now Unicode;
checkstr(decode_utf8(readwrite $ascii));
checkstr(decode_utf8(readwrite $utf8));
__END__
you run the code as follows (on my Mac OS X v10.4.2);
% /usr/bin/perl utf8flag.pl
Perl Version is 5.008006, Encode Version is 2.08
Encode::is_utf8($str) =
utf8::is_utf8($str) =
Encode::is_utf8($str) = 1
utf8::is_utf8($str) = 1
% /usr/bin/perl -T utf8flag.pl
Perl Version is 5.008006, Encode Version is 2.08
Encode::is_utf8($str) =
utf8::is_utf8($str) =
Encode::is_utf8($str) =
utf8::is_utf8($str) = 1
% perl utf8flag.pl
Perl Version is 5.008007, Encode Version is 2.10
Encode::is_utf8($str) = 1
utf8::is_utf8($str) = 1
Encode::is_utf8($str) = 1
utf8::is_utf8($str) = 1
% perl -T utf8flag.pl
Perl Version is 5.008007, Encode Version is 2.10
Encode::is_utf8($str) = 1
utf8::is_utf8($str) = 1
Encode::is_utf8($str) = 1
utf8::is_utf8($str) = 1
As you see, on 5.8.6 utf8::is_utf8() works fine while Encode::is_utf8
() does not. Also note on 5.8.7 the flag is set UNCONDITIONALLY,
whether the string contains U+100 and above or not.
/* universal.c */
XS(XS_utf8_is_utf8)
{
dXSARGS;
if (items != 1)
Perl_croak(aTHX_ "Usage: utf8::is_utf8(sv)");
{
SV * sv = ST(0);
{
if (SvUTF8(sv))
XSRETURN_YES;
else
XSRETURN_NO;
}
}
XSRETURN_EMPTY;
}
/* end of code */
/* ext/Encode/Encode.xs */
bool
is_utf8(sv, check = 0)
SV * sv
int check
CODE:
{
if (SvGMAGICAL(sv)) /* it could be $1, for example */
sv = newSVsv(sv); /* GMAGIG will be done */
if (SvPOK(sv)) {
RETVAL = SvUTF8(sv) ? TRUE : FALSE;
if (RETVAL &&
check &&
!is_utf8_string((U8*)SvPVX(sv), SvCUR(sv)))
RETVAL = FALSE;
} else {
RETVAL = FALSE;
}
if (sv != ST(0))
SvREFCNT_dec(sv); /* it was a temp copy */
}
OUTPUT:
RETVAL
/* end of code */
Though not harmful, the behavior of 5.8.7 is not as documented as in
Encode. Should I fix the pod accordingly or did it just reveal
undocumented bug?
Dan the Encode Maintainer