On Sun, 28 Jan 2001 [EMAIL PROTECTED] wrote:
>
> I don't think there are any good examples. If there were any good
> examples, it would mean chop would be a useful function to have.
> But after the arrival of chomp, they only reason to keep chop is backwards
> compatability.
Oddly, I had reason to use chop this morning responding to someone on how
to detect valid UTF8 strings.
The code that resulted was:
#!/usr/bin/perl -w
# See <URL:http://www.unicode.org/unicode/uni2errata/UTF-8_Corrigendum.html>
# table 3.1B for a table of legal and illegal codes
# Should return true for legal sequences, false for
# sequences that cannot be legal UTF8
# Not tested....
sub legal_as_utf8_byte_sequence {
my ($data) = @_;
# Short circuit for speed in some common cases
return if (not defined $data);
return 1 if ($data eq '');
return 1 if ($data =~ m/^([\x00-\x7F]*)$/s);
$data = reverse $data;
while ($data ne '') {
# One byte codes
my $byte1 = chop $data;
next if ($byte1 =~ m/^[\x00-\x7F]*$/);
# Is it a possible multi-byte code?
return if ($byte1 !~ m/^[\xC2-\xF4]$/);
# Two byte codes.
my $byte2 = chop $data;
return if (not defined $byte2);
return if ($byte2 !~ m/^[\x80-\xBF]$/);
next if ($byte1 =~ m/^[\xC2-\xDF]$/);
# Three byte codes.
my $byte3 = chop $data;
return if (not defined $byte3);
return if ($byte3 !~ m/^[\x80-\xBF]$/);
next if ($byte1 =~ m/^[\xE1-\xEF]$/);
return if (($byte1 eq "\xE0") && ($byte2 !~ m/^[\xA0-\xBF]$/));
# Four byte codes
my $byte4 = chop $data;
return if (not defined $byte4);
return if ($byte4 !~ m/^[\x80-\xBF]$/);
return if ($byte1 !~ m/^[\xF0-\xF4]$/);
return if (($byte1 eq "\xF0") && ($byte2 !~ m/^[\x90-\xBF]$/));
return if (($byte1 eq "\xF4") && ($byte2 !~ m/^[\x8F-\xBF]$/));
}
# If we make it here, it could be a legal UTF8 byte sequence
return 1;
}
--
Benjamin Franz
... with proper design, the features come cheaply. This
approach is arduous, but continues to succeed.
---Dennis Ritchie