Re: Need better chop examples.

Benjamin Franz Wed, 14 Feb 2001 10:24:24 -0800
On Sun, 28 Jan 2001 [EMAIL PROTECTED] wrote:
> 
> I don't think there are any good examples. If there were any good
> examples, it would mean chop would be a useful function to have.
> But after the arrival of chomp, they only reason to keep chop is backwards
> compatability.

Oddly, I had reason to use chop this morning responding to someone on how
to detect valid UTF8 strings.

The code that resulted was:

#!/usr/bin/perl -w

# See <URL:http://www.unicode.org/unicode/uni2errata/UTF-8_Corrigendum.html>
# table 3.1B for a table of legal and illegal codes

# Should return true for legal sequences, false for
# sequences that cannot be legal UTF8

# Not tested....

sub legal_as_utf8_byte_sequence {
    my ($data) = @_;

    # Short circuit for speed in some common cases
    return if (not defined $data);
    return 1 if ($data eq '');
    return 1 if ($data =~ m/^([\x00-\x7F]*)$/s);

    $data = reverse $data;
    while ($data ne '') {

        # One byte codes
        my $byte1 = chop $data;
        next if ($byte1 =~ m/^[\x00-\x7F]*$/);

        # Is it a possible multi-byte code?
        return if ($byte1 !~ m/^[\xC2-\xF4]$/);

        # Two byte codes.
        my $byte2 = chop $data;
        return if (not defined $byte2);
        return if ($byte2 !~ m/^[\x80-\xBF]$/);
        next if ($byte1 =~ m/^[\xC2-\xDF]$/);

        # Three byte codes.
        my $byte3 = chop $data;
        return if (not defined $byte3);
        return if ($byte3 !~ m/^[\x80-\xBF]$/);
        next if ($byte1 =~ m/^[\xE1-\xEF]$/);
        return if (($byte1 eq "\xE0") && ($byte2 !~ m/^[\xA0-\xBF]$/));

        # Four byte codes
        my $byte4 = chop $data;
        return if (not defined $byte4);
        return if ($byte4 !~ m/^[\x80-\xBF]$/);
        return if ($byte1 !~ m/^[\xF0-\xF4]$/);
        return if (($byte1 eq "\xF0") && ($byte2 !~ m/^[\x90-\xBF]$/));
        return if (($byte1 eq "\xF4") && ($byte2 !~ m/^[\x8F-\xBF]$/));
    }

    # If we make it here, it could be a legal UTF8 byte sequence
    return 1;
}


-- 
Benjamin Franz

... with proper design, the features come cheaply. This 
approach is arduous, but continues to succeed.

                                     ---Dennis Ritchie
Re: Need better chop examples.

Reply via email to