[perl.git] branch blead, updated. v5.23.2-13-g356ad74

Karl Williamson Thu, 20 Aug 2015 11:50:04 -0700

In perl.git, the branch blead has been updated

<http://perl5.git.perl.org/perl.git/commitdiff/356ad745a622c0a0675fbc8f0abb7a0eb2e5ce51?hp=c9f3f183113edf79fb8e5aac6822d93d09d7b203>


- Log -----------------------------------------------------------------
commit 356ad745a622c0a0675fbc8f0abb7a0eb2e5ce51
Author: Karl Williamson <[email protected]>
Date:   Thu Aug 20 11:52:51 2015 -0600

    mktables: Comment changes only

M       charclass_invlists.h
M       lib/unicore/mktables
M       regcharclass.h

commit f8b690041a9dddc2b00dd38a514deed7abc7aa3f
Author: Karl Williamson <[email protected]>
Date:   Thu Aug 20 11:50:39 2015 -0600

    mktables: Move file handling to non-exceptional order
    
    The DAge.txt property until the previous commit had to be handled
    out-of-the-normal order.  This is no longer required.

M       charclass_invlists.h
M       lib/unicore/mktables
M       regcharclass.h

commit 8f21889c2a0e574a419f6f6dbb2090be2bf6ff8f
Author: Karl Williamson <[email protected]>
Date:   Thu Aug 20 11:35:21 2015 -0600

    mktables: Revamp the compare versions functionality
    
    This functionality is rarely used, but enables someone to see what
    Unicode has changed between releases X and Y, without the clutter of the
    things that are added after X came out.  In other words it compiles
    release X using Y's rules.  To use it, you must go in and edit mktables
    to specify to use this; so it is intended only for a developer who wants
    to look at Unicode history.  One use I've done is to look at the beta
    version of a new release to compare with the previous official one.
    This allows me to find typos, and unintentional changes and report them
    back to Unicode.
    
    This commit significantly overhauls this feature, giving better results
    than before.

M       charclass_invlists.h
M       lib/unicore/mktables
M       regcharclass.h

commit 1254636bfb1e1850e3455fb1ac48a1211d12f341
Author: Karl Williamson <[email protected]>
Date:   Thu Aug 20 11:03:47 2015 -0600

    mktables: Fix so -annotate works on early Unicodes
    
    There were several glitches when compiling very early Unicode releases.
    This commit changes things so the age property reference is stored in a
    global, and doesn't have to be refound multiple times.

M       charclass_invlists.h
M       lib/unicore/mktables
M       regcharclass.h

commit e47e66b9fe2b9c42aa165717831b2cb37353c36a
Author: Karl Williamson <[email protected]>
Date:   Thu Aug 20 10:42:36 2015 -0600

    mktables: Move code to common functions
    
    This takes two code sections and moves them to a function each.  For
    one, this is in preparation for being used in a 2nd place.  For the
    other, call the code in existing other places.

M       charclass_invlists.h
M       lib/unicore/mktables
M       regcharclass.h

commit a39459e754d2246aece7c023c999ba1368e1edaf
Author: Karl Williamson <[email protected]>
Date:   Thu Aug 20 10:48:36 2015 -0600

    mktables: Fix up property calc for early Unicodes
    
    The Default_Ignorable_Code_Point property is applicable to unassigned
    code points, so shouldn't restrict our calculated value to assigned.
    (We calculate what the property would be when run on Unicode releases
    that haven't defined it yet.)

M       charclass_invlists.h
M       lib/unicore/mktables
M       regcharclass.h

commit 53adf6a2e6002345048c5137b16445f8dbf4bb81
Author: Karl Williamson <[email protected]>
Date:   Thu Aug 20 10:20:21 2015 -0600

    mktables: Use mnemonic instead of hex constant
    
    These constants are used in more than one place.  Use a common variable
    instead of repeating the hex numbers

M       charclass_invlists.h
M       lib/unicore/mktables
M       regcharclass.h
-----------------------------------------------------------------------

Summary of changes:
 charclass_invlists.h |   2 +-
 lib/unicore/mktables | 611 +++++++++++++++++++++++++++++++++++++--------------
 regcharclass.h       |   2 +-
 3 files changed, 451 insertions(+), 164 deletions(-)

diff --git a/charclass_invlists.h b/charclass_invlists.h
index 8f14d31..91c38a3 100644
--- a/charclass_invlists.h
+++ b/charclass_invlists.h
@@ -99537,7 +99537,7 @@ static const UV XPosixXDigit_invlist[] = { /* for 
EBCDIC POSIX-BC */
  * 1a0687fb9c6c4567e853913549df0944fe40821279a3e9cdaa6ab8679bc286fd 
lib/unicore/extracted/DLineBreak.txt
  * 40bcfed3ca727c19e1331f6c33806231d5f7eeeabd2e6a9e06a3740c85d0c250 
lib/unicore/extracted/DNumType.txt
  * a18d502bad39d527ac5586d7bc93e29f565859e3bcc24ada627eff606d6f5fed 
lib/unicore/extracted/DNumValues.txt
- * ad739a46951b5f46396074b0682a2cfeed24b633a742a8e1aa0e337f69ef8b1c 
lib/unicore/mktables
+ * d5895407f73f1bcb0d7ad39e955c0e88a9145e401e868572849eeb134e669269 
lib/unicore/mktables
  * 462c9aaa608fb2014cd9649af1c5c009485c60b9c8b15b89401fdc10cf6161c6 
lib/unicore/version
  * c6884f4d629f04d1316f3476cb1050b6a1b98ca30c903262955d4eae337c6b1e 
regen/charset_translations.pl
  * 8a097f8f726bb1619af2f27f149ab87e60a1602f790147e3a561358be16abd27 
regen/mk_invlists.pl
diff --git a/lib/unicore/mktables b/lib/unicore/mktables
index 8153936..c621923 100644
--- a/lib/unicore/mktables
+++ b/lib/unicore/mktables
@@ -615,15 +615,17 @@ our $to_trace = 0;
 # This is for a rarely used development feature that allows you to compare two
 # versions of the Unicode standard without having to deal with changes caused
 # by the code points introduced in the later version.  You probably also want
-# to use the -annotate option when using this.  Change the 0 to a string
-# containing a SINGLE dotted Unicode release number (e.g. "2.1").  Only code
-# points introduced in that release and earlier will be used; later ones are
-# thrown away.  You use the version number of the earliest one you want to
-# compare; then run this program on directory structures containing each
-# release, and compare the outputs.  These outputs will therefore include only
-# the code points common to both releases, and you can see the changes caused
-# just by the underlying release semantic changes.  For versions earlier than
-# 3.2, you must copy a version of DAge.txt into the directory.
+# to use the -annotate option when using this.  Run this program on a unicore
+# containing the starting release you want to compare.  Save that output
+# structrue.  Then, switching to a unicore with the ending release, change the
+# 0 in the $string_compare_versions definition just below to a string
+# containing a SINGLE dotted Unicode release number (e.g. "2.1") corresponding
+# to the starting release.  This program will then compile, but throw away all
+# code points introduced after the starting release.  Finally use a diff tool
+# to compare the two directory structures.  They include only the code points
+# common to both releases, and you can see the changes caused just by the
+# underlying release semantic changes.  For versions earlier than 3.2, you
+# must copy a version of DAge.txt into the directory.
 my $string_compare_versions = DEBUG && 0; #  e.g., "2.1";
 my $compare_versions = DEBUG
                        && $string_compare_versions
@@ -802,6 +804,11 @@ close $VERSION;
 chomp $string_version;
 my $v_version = pack "C*", split /\./, $string_version;        # v string
 
+my $unicode_version = ($compare_versions)
+                      ? (  "$string_compare_versions (using "
+                         . "$string_version rules)")
+                      : $string_version;
+
 # The following are the complete names of properties with property values that
 # are known to not match any code points in some versions of Unicode, but that
 # may change in the future so they should be matchable, hence an empty file is
@@ -1071,7 +1078,7 @@ my %default_mapping = (
 my $HEADER=<<"EOF";
 # !!!!!!!   DO NOT EDIT THIS FILE   !!!!!!!
 # This file is machine-generated by $0 from the Unicode
-# database, Version $string_version.  Any changes made here will be lost!
+# database, Version $unicode_version.  Any changes made here will be lost!
 EOF
 
 my $INTERNAL_ONLY_HEADER = <<"EOF";
@@ -1320,6 +1327,12 @@ my %ucd_pod;    # Holds entries that will go into the 
UCD section of the pod
 # unlikely that they will ever change.
 my %caseless_equivalent_to;
 
+# This is the range of characters that were in Release 1 of Unicode, and
+# removed in Release 2 (replaced with the current Hangul syllables starting at
+# U+AC00).  The range was reused starting in Release 3 for other purposes.
+my $FIRST_REMOVED_HANGUL_SYLLABLE = 0x3400;
+my $FINAL_REMOVED_HANGUL_SYLLABLE = 0x4DFF;
+
 # These constants names and values were taken from the Unicode standard,
 # version 5.1, section 3.12.  They are used in conjunction with Hangul
 # syllables.  The '_string' versions are so generated tables can retain the
@@ -1393,6 +1406,7 @@ my $MIN_FRACTION_LENGTH = 3; # How many digits of a 
floating point number at
 my $MAX_FLOATING_SLOP = 10 ** - $MIN_FRACTION_LENGTH; # And in floating terms
 
 # These store references to certain commonly used property objects
+my $age;
 my $ccc;
 my $gc;
 my $perl;
@@ -1401,6 +1415,8 @@ my $perl_charname;
 my $print;
 my $All;
 my $Assigned;   # All assigned characters in this Unicode release
+my $DI;         # Default_Ignorable_Code_Point property
+my $NChar;      # Noncharacter_Code_Point property
 my $script;
 
 # Are there conflicting names because of beginning with 'In_', or 'Is_'
@@ -1476,12 +1492,28 @@ sub populate_char_info ($) {
     Carp::carp_extra_args(\@_) if main::DEBUG && @_;
 
     $viacode[$i] = $perl_charname->value_of($i) || "";
+    $age[$i] = (defined $age)
+               ? (($age->value_of($i) =~ / ^ \d \. \d $ /x)
+                  ? $age->value_of($i)
+                  : "")
+               : "";
 
     # A character is generally printable if Unicode says it is,
     # but below we make sure that most Unicode general category 'C' types
     # aren't.
     $printable[$i] = $print->contains($i);
 
+    # But the characters in this range were removed in v2.0 and replaced by
+    # different ones later.  Modern fonts will be for the replacement
+    # characters, so suppress printing them.
+    if (($v_version lt v2.0
+         || ($compare_versions && $compare_versions lt v2.0))
+        && (   $i >= $FIRST_REMOVED_HANGUL_SYLLABLE
+            && $i <= $FINAL_REMOVED_HANGUL_SYLLABLE))
+    {
+        $printable[$i] = 0;
+    }
+
     $annotate_char_type[$i] = $perl_charname->type_of($i) || 0;
 
     # Only these two regular types are treated specially for annotations
@@ -1494,47 +1526,41 @@ sub populate_char_info ($) {
     # point of the range.
     my $end;
     if (! $viacode[$i]) {
-        my $nonchar;
         if ($i > $MAX_UNICODE_CODEPOINT) {
             $viacode[$i] = 'Above-Unicode';
             $annotate_char_type[$i] = $ABOVE_UNICODE_TYPE;
             $printable[$i] = 0;
             $end = $MAX_WORKING_CODEPOINT;
-            $age[$i] = "";
         }
         elsif ($gc-> table('Private_use')->contains($i)) {
             $viacode[$i] = 'Private Use';
             $annotate_char_type[$i] = $PRIVATE_USE_TYPE;
             $printable[$i] = 0;
             $end = $gc->table('Private_Use')->containing_range($i)->end;
-            $age[$i] = property_ref("Age")->value_of($i);
         }
-        elsif ((defined ($nonchar =
-                            Property::property_ref('Noncharacter_Code_Point'))
-               && $nonchar->table('Y')->contains($i)))
-        {
+        elsif ($NChar->contains($i)) {
             $viacode[$i] = 'Noncharacter';
             $annotate_char_type[$i] = $NONCHARACTER_TYPE;
             $printable[$i] = 0;
-            $end = property_ref('Noncharacter_Code_Point')->table('Y')->
-                                                    containing_range($i)->end;
-            $age[$i] = property_ref("Age")->value_of($i);
+            $end = $NChar->containing_range($i)->end;
         }
         elsif ($gc-> table('Control')->contains($i)) {
-            $viacode[$i] = property_ref('Name_Alias')->value_of($i) || 
'Control';
+            my $name_ref = property_ref('Name_Alias');
+            $name_ref = property_ref('Unicode_1_Name') if ! defined $name_ref;
+            $viacode[$i] = (defined $name_ref)
+                           ? $name_ref->value_of($i)
+                           : 'Control';
             $annotate_char_type[$i] = $CONTROL_TYPE;
             $printable[$i] = 0;
-            $age[$i] = property_ref("Age")->value_of($i);
         }
         elsif ($gc-> table('Unassigned')->contains($i)) {
             $annotate_char_type[$i] = $UNASSIGNED_TYPE;
             $printable[$i] = 0;
+            $viacode[$i] = 'Unassigned';
+
             if (defined $block) { # No blocks in earliest releases
-                $viacode[$i] = 'Unassigned';
+                $viacode[$i] .= ', block=' . $block-> value_of($i);
                 $end = $gc-> table('Unassigned')->containing_range($i)->end;
-            }
-            else {
-                $viacode[$i] = 'Unassigned, block=' . $block-> value_of($i);
 
                 # Because we name the unassigned by the blocks they are in, it
                 # can't go past the end of that block, and it also can't go
@@ -1545,14 +1571,19 @@ sub populate_char_info ($) {
                            $unassigned_sans_noncharacters->
                                                     containing_range($i)->end);
             }
-            $age[$i] = property_ref("Age")->value_of($i);
+            else {
+                $end = $i + 1;
+                while ($unassigned_sans_noncharacters->contains($end)) {
+                    $end++;
+                }
+                $end--;
+            }
         }
         elsif ($perl->table('_Perl_Surrogate')->contains($i)) {
             $viacode[$i] = 'Surrogate';
             $annotate_char_type[$i] = $SURROGATE_TYPE;
             $printable[$i] = 0;
             $end = $gc->table('Surrogate')->containing_range($i)->end;
-            $age[$i] = property_ref("Age")->value_of($i);
         }
         else {
             Carp::my_carp_bug("Can't figure out how to annotate "
@@ -1569,17 +1600,20 @@ sub populate_char_info ($) {
     elsif ($annotate_char_type[$i] == $CP_IN_NAME) {
         $viacode[$i] .= sprintf("-%04X", $i);
 
-        # Do all these as groups of the same age, instead of individually,
-        # because their names are so meaningless, and there are typically
-        # large quantities of them.
-        my $Age = property_ref("Age");
-        $age[$i] = $Age->value_of($i);
         my $limit = $perl_charname->containing_range($i)->end;
-        $end = $i + 1;
-        while ($end <= $limit && $Age->value_of($end) == $age[$i]) {
-            $end++;
+        if (defined $age) {
+            # Do all these as groups of the same age, instead of individually,
+            # because their names are so meaningless, and there are typically
+            # large quantities of them.
+            $end = $i + 1;
+            while ($end <= $limit && $age->value_of($end) == $age[$i]) {
+                $end++;
+            }
+            $end--;
+        }
+        else {
+            $end = $limit;
         }
-        $end--;
     }
 
     # And here, has a name, but if it's a hangul syllable one, replace it with
@@ -1592,12 +1626,8 @@ sub populate_char_info ($) {
         my $T = $TBase + $SIndex % $TCount;
         $viacode[$i] = "HANGUL SYLLABLE $Jamo{$L}$Jamo{$V}";
         $viacode[$i] .= $Jamo{$T} if $T != $TBase;
-        $age[$i] = property_ref("Age")->value_of($i);
         $end = $perl_charname->containing_range($i)->end;
     }
-    else {
-        $age[$i] = property_ref("Age")->value_of($i);
-    }
 
     return if ! defined wantarray;
     return $i if ! defined $end;    # If not a range, return the input
@@ -2602,7 +2632,7 @@ END
         # once per file, as it destroy's the EOF handlers
 
         # flag to make sure extracted files are processed early
-        state $seen_non_extracted_non_age = 0;
+        state $seen_non_extracted = 0;
 
         my $self = shift;
         Carp::carp_extra_args(\@_) if main::DEBUG && @_;
@@ -2615,7 +2645,7 @@ END
             $handle{$addr} = 'pretend_is_open';
         }
         else {
-            if ($seen_non_extracted_non_age) {
+            if ($seen_non_extracted) {
                 if ($file =~ /$EXTRACTED/i) # Some platforms may change the
                                             # case of the file's name
                 {
@@ -2632,13 +2662,12 @@ END
                     # We only do this check for generic property files
                     && $handler{$addr} == \&main::process_generic_property_file
 
-                    && $file !~ /$EXTRACTED/i
-                    && lc($file) ne 'dage.txt')
+                    && $file !~ /$EXTRACTED/i)
             {
                 # We don't set this (by the 'if' above) if we have no
                 # extracted directory, so if running on an early version,
                 # this test won't work.  Not worth worrying about.
-                $seen_non_extracted_non_age = 1;
+                $seen_non_extracted = 1;
             }
 
             # Mark the file as having being processed, and warn if it
@@ -3231,7 +3260,8 @@ END
         Carp::carp_extra_args(\@_) if main::DEBUG && @_;
 
         my $object = main::property_ref($property{$addr});
-        $object->add_map(0x3400, 0x4DFF,
+        $object->add_map($FIRST_REMOVED_HANGUL_SYLLABLE,
+                         $FINAL_REMOVED_HANGUL_SYLLABLE,
                          $early{$addr}[3],  # Passed-in value for these
                          Replace => $UNCONDITIONALLY);
     }
@@ -7123,7 +7153,7 @@ END
             else {
                 $cp = "one of the $code_points";
             }
-            $cp .= " in Unicode Version $string_version for which the mapping 
is not to $map_to";
+            $cp .= " in Unicode Version $unicode_version for which the mapping 
is not to $map_to";
         }
 
         my $comment = "";
@@ -8369,7 +8399,7 @@ resources, every table that matches the identical set of 
code points in this
 version of Unicode uses this file.  Each one is listed in a separate group
 below.  It could be that the tables will match the same set of code points in
 other Unicode releases, or it could be purely coincidence that they happen to
-be the same in Unicode $string_version, and hence may not in other versions.
+be the same in Unicode $unicode_version, and hence may not in other versions.
 
 END
         }
@@ -8394,7 +8424,7 @@ END
             Carp::my_carp("No regular expression construct can match $leader, 
as all names for it are the null string.  Creating file anyway.");
             $comment .= <<END;
 This file returns the $code_points in Unicode Version
-$string_version for
+$unicode_version for
 $leader, but it is inaccessible through Perl regular expressions, as
 "\\p{prop=}" is not recognized.
 END
@@ -8402,7 +8432,7 @@ END
         } else {
             $comment .= <<END;
 This file returns the $code_points in Unicode Version
-$string_version that
+$unicode_version that
 $match$synonyms:
 
 $matches_comment
@@ -9914,6 +9944,7 @@ sub finish_property_setup {
     $gc = property_ref('General_Category');
     $block = property_ref('Block');
     $script = property_ref('Script');
+    $age = property_ref('Age');
 
     # Perl adds this alias.
     $gc->add_alias('Category');
@@ -10344,7 +10375,6 @@ END
     # As noted in the comments early in the program, it generates tables for
     # the default values for all releases, even those for which the concept
     # didn't exist at the time.  Here we add those if missing.
-    my $age = property_ref('age');
     if (defined $age && ! defined $age->table('Unassigned')) {
         $age->add_match_table('Unassigned');
     }
@@ -10711,9 +10741,6 @@ sub output_perl_charnames_line ($$) {
 }
 
 { # Closure
-    # This is used to store the range list of all the code points usable when
-    # the little used $compare_versions feature is enabled.
-    my $compare_versions_range_list;
 
     # These are constants to the $property_info hash in this subroutine, to
     # avoid using a quoted-string which might have a typo.
@@ -10818,73 +10845,6 @@ sub output_perl_charnames_line ($$) {
             my $low = hex $1;
             my $high = (defined $2) ? hex $2 : $low;
 
-            # For the very specialized case of comparing two Unicode
-            # versions...
-            if (DEBUG && $compare_versions) {
-                if ($property_name eq 'Age') {
-
-                    # Only allow code points at least as old as the version
-                    # specified.
-                    my $age = pack "C*", split(/\./, $map);        # v string
-                    next LINE if $age gt $compare_versions;
-                }
-                else {
-
-                    # Again, we throw out code points younger than those of
-                    # the specified version.  By now, the Age property is
-                    # populated.  We use the intersection of each input range
-                    # with this property to find what code points in it are
-                    # valid.   To do the intersection, we have to convert the
-                    # Age property map to a Range_list.  We only have to do
-                    # this once.
-                    if (! defined $compare_versions_range_list) {
-                        my $age = property_ref('Age');
-                        if (! -e 'DAge.txt') {
-                            croak "Need to have 'DAge.txt' file to do version 
comparison";
-                        }
-                        elsif ($age->count == 0) {
-                            croak "The 'Age' table is empty, but its file 
exists";
-                        }
-                        $compare_versions_range_list
-                                        = Range_List->new(Initialize => $age);
-                    }
-
-                    # An undefined map is always 'Y'
-                    $map = 'Y' if ! defined $map;
-
-                    # Calculate the intersection of the input range with the
-                    # code points that are known in the specified version
-                    my @ranges = ($compare_versions_range_list
-                                  & Range->new($low, $high))->ranges;
-
-                    # If the intersection is empty, throw away this range
-                    next LINE unless @ranges;
-
-                    # Only examine the first range this time through the loop.
-                    my $this_range = shift @ranges;
-
-                    # Put any remaining ranges in the queue to be processed
-                    # later.  Note that there is unnecessary work here, as we
-                    # will do the intersection again for each of these ranges
-                    # during some future iteration of the LINE loop, but this
-                    # code is not used in production.  The later intersections
-                    # are guaranteed to not splinter, so this will not become
-                    # an infinite loop.
-                    my $line = join ';', $property_name, $map;
-                    foreach my $range (@ranges) {
-                        $file->insert_adjusted_lines(sprintf("%04X..%04X; %s",
-                                                            $range->start,
-                                                            $range->end,
-                                                            $line));
-                    }
-
-                    # And set things up so that the below will process this 
first
-                    # range, like any other.
-                    $low = $this_range->start;
-                    $high = $this_range->end;
-                }
-            } # End of $compare_versions
-
             # If changing to a new property, get the things constant per
             # property
             if ($previous_property_name ne $property_name) {
@@ -13059,7 +13019,9 @@ END
     # not being right at all.
     if ($v_version lt v2.0.0) {
         my $property = property_ref($file->property);
-        $file->insert_lines("3400..4DFF; LVT\n");
+        $file->insert_lines(sprintf("%04X..%04X; LVT\n",
+                                    $FIRST_REMOVED_HANGUL_SYLLABLE,
+                                    $FINAL_REMOVED_HANGUL_SYLLABLE));
         push @tables_that_may_be_empty, $property->table('LV')->complete_name;
         return;
     }
@@ -13290,9 +13252,9 @@ END
 
     # For each property, fill in any missing mappings, and calculate the re
     # match tables.  If a property has more than one missing mapping, the
-    # default is a reference to a data structure, and requires data from other
-    # properties to resolve.  The sort is used to cause these to be processed
-    # last, after all the other properties have been calculated.
+    # default is a reference to a data structure, and may require data from
+    # other properties to resolve.  The sort is used to cause these to be
+    # processed last, after all the other properties have been calculated.
     # (Fortunately, the missing properties so far don't depend on each other.)
     foreach my $property
         (sort { (defined $a->default_map && ref $a->default_map) ? 1 : -1 }
@@ -13602,7 +13564,7 @@ sub pre_3_dot_1_Nl () {
     return $Nl;
 }
 
-sub calculate_Assigned() {  # Calculate the gc != Cn code points; may be
+sub calculate_Assigned() {  # Set $Assigned to the gc != Cn code points; may be
                             # called before the Cn's are completely filled.
                             # Works on Unicodes earlier than ones that
                             # explicitly specify Cn.
@@ -13622,6 +13584,339 @@ sub calculate_Assigned() {  # Calculate the gc != Cn 
code points; may be
     }
 }
 
+sub calculate_DI() {    # Set $DI to a Range_List equivalent to the
+                        # Default_Ignorable_Code_Point property.  Works on
+                        # Unicodes earlier than ones that explicitly specify
+                        # DI.
+    return if defined $DI;
+
+    if (defined (my $di = property_ref('Default_Ignorable_Code_Point'))) {
+        $DI = $di->table('Y');
+    }
+    else {
+        $DI = Range_List->new(Initialize => [ 0x180B .. 0x180D,
+                                              0x2060 .. 0x206F,
+                                              0xFE00 .. 0xFE0F,
+                                              0xFFF0 .. 0xFFFB,
+                                            ]);
+        if ($v_version ge v2.0) {
+            $DI += $gc->table('Cf')
+                +  $gc->table('Cs');
+
+            # These are above the Unicode version 1 max
+            $DI->add_range(0xE0000, 0xE0FFF);
+        }
+        $DI += $gc->table('Cc')
+             - ord("\t")
+             - utf8::unicode_to_native(0x0A)  # LINE FEED
+             - utf8::unicode_to_native(0x0B)  # VERTICAL TAB
+             - ord("\f")
+             - utf8::unicode_to_native(0x0D)  # CARRIAGE RETURN
+             - utf8::unicode_to_native(0x85); # NEL
+    }
+}
+
+sub calculate_NChar() {  # Create a Perl extension match table which is the
+                         # same as the Noncharacter_Code_Point property, and
+                         # set $NChar to point to it.  Works on Unicodes
+                         # earlier than ones that explicitly specify NChar
+    return if defined $NChar;
+
+    $NChar = $perl->add_match_table('_Perl_Nchar',
+                                    Perl_Extension => 1,
+                                    Fate => $INTERNAL_ONLY);
+    if (defined (my $off_nchar = property_ref('NChar'))) {
+        $NChar->initialize($off_nchar->table('Y'));
+    }
+    else {
+        $NChar->initialize([ 0xFFFE .. 0xFFFF ]);
+        if ($v_version ge v2.0) {   # First release with these nchars
+            for (my $i = 0x1FFFE; $i <= 0x10FFFE; $i += 0x10000) {
+                $NChar += [ $i .. $i+1 ];
+            }
+        }
+    }
+}
+
+sub handle_compare_versions () {
+    # This fixes things up for the $compare_versions capability, where we
+    # compare Unicode version X with version Y (with Y > X), and we are
+    # running it on the Unicode Data for version Y.
+    #
+    # It works by calculating the code points whose meaning has been specified
+    # after release X, by using the Age property.  The complement of this set
+    # is the set of code points whose meaning is unchanged between the
+    # releases.  This is the set the program restricts itself to.  It includes
+    # everything whose meaning has been specified by the time version X came
+    # along, plus those still unassigned by the time of version Y.  (We will
+    # continue to use the word 'assigned' to mean 'meaning has been
+    # specified', as it's shorter and is accurate in all cases except the
+    # Noncharacter code points.)
+    #
+    # This function is run after all the properties specified by Unicode have
+    # been calculated for release Y.  This makes sure we get all the nuances
+    # of Y's rules.  (It is done before the Perl extensions are calculated, as
+    # those are based entirely on the Unicode ones.)  But doing it after the
+    # Unicode table calculations means we have to fix up the Unicode tables.
+    # We do this by subtracting the code points that have been assigned since
+    # X (which is actually done by ANDing each table of assigned code points
+    # with the set of unchanged code points).  Most Unicode properties are of
+    # the form such that all unassigned code points have a default, grab-bag,
+    # property value which is changed when the code point gets assigned.  For
+    # these, we just remove the changed code points from the table for the
+    # latter property value, and add them back in to the grab-bag one.  A few
+    # other properties are not entirely of this form and have values for some
+    # or all unassigned code points that are not the grab-bag one.  These have
+    # to be handled specially, and are hard-coded in to this routine based on
+    # manual inspection of the Unicode character database.  A list of the
+    # outlier code points is made for each of these properties, and those
+    # outliers are excluded from adding and removing from tables.
+    #
+    # Note that there are glitches when comparing against Unicode 1.1, as some
+    # Hangul syllables in it were later ripped out and eventually replaced
+    # with other things.
+
+    print "Fixing up for version comparison\n" if $verbosity >= $PROGRESS;
+
+    my $after_first_version = "All matching code points were added after "
+                            . "Unicode $string_compare_versions";
+
+    # Calculate the delta as those code points that have been newly assigned
+    # since the first compare version.
+    my $delta = Range_List->new();
+    foreach my $table ($age->tables) {
+        next if $table == $age->table('Unassigned');
+        next if $table->name le $string_compare_versions;
+        $delta += $table;
+    }
+    if ($delta->is_empty) {
+        die ("No changes; perhaps you need a 'DAge.txt' file?");
+    }
+
+    my $unchanged = ~ $delta;
+
+    calculate_Assigned() if ! defined $Assigned;
+    $Assigned &= $unchanged;
+
+    # $Assigned now contains the code points that were assigned as of Unicode
+    # version X.
+
+    # A block is all or nothing.  If nothing is assigned in it, it all goes
+    # back to the No_Block pool; but if even one code point is assigned, the
+    # block is retained.
+    my $no_block = $block->table('No_Block');
+    foreach my $this_block ($block->tables) {
+        next if     $this_block == $no_block
+                ||  ! ($this_block & $Assigned)->is_empty;
+        $this_block->set_fate($SUPPRESSED, $after_first_version);
+        $no_block += $this_block;
+    }
+
+    my @special_delta_properties;   # List of properties that have to be
+                                    # handled specially.
+    my %restricted_delta;           # Keys are the entries in
+                                    # @special_delta_properties;  values
+                                    # are the range list of the code points
+                                    # that behave normally when they get
+                                    # assigned.
+
+    # In the next three properties, the Default Ignorable code points are
+    # outliers.
+    calculate_DI();
+    $DI &= $unchanged;
+
+    push @special_delta_properties, property_ref('_Perl_GCB');
+    $restricted_delta{$special_delta_properties[-1]} = ~ $DI;
+
+    if (defined (my $cwnfkcc = property_ref('Changes_When_NFKC_Casefolded')))
+    {
+        push @special_delta_properties, $cwnfkcc;
+        $restricted_delta{$special_delta_properties[-1]} = ~ $DI;
+    }
+
+    calculate_NChar();      # Non-character code points
+    $NChar &= $unchanged;
+
+    # This may have to be updated from time-to-time to get the most accurate
+    # results.
+    my $default_BC_non_LtoR = Range_List->new(Initialize =>
+                        # These came from the comments in v8.0 DBidiClass.txt
+                                                        [ # AL
+                                                            0x0600 .. 0x07BF,
+                                                            0x08A0 .. 0x08FF,
+                                                            0xFB50 .. 0xFDCF,
+                                                            0xFDF0 .. 0xFDFF,
+                                                            0xFE70 .. 0xFEFF,
+                                                            0x1EE00 .. 0x1EEFF,
+                                                           # R
+                                                            0x0590 .. 0x05FF,
+                                                            0x07C0 .. 0x089F,
+                                                            0xFB1D .. 0xFB4F,
+                                                            0x10800 .. 0x10FFF,
+                                                            0x1E800 .. 0x1EDFF,
+                                                            0x1EF00 .. 0x1EFFF,
+                                                           # ET
+                                                            0x20A0 .. 0x20CF,
+                                                         ]
+                                          );
+    $default_BC_non_LtoR += $DI + $NChar;
+    push @special_delta_properties, property_ref('BidiClass');
+    $restricted_delta{$special_delta_properties[-1]} = ~ $default_BC_non_LtoR;
+
+    if (defined (my $eaw = property_ref('East_Asian_Width'))) {
+
+        my $default_EA_width_W = Range_List->new(Initialize =>
+                                    # From comments in v8.0 EastAsianWidth.txt
+                                                [
+                                                    0x3400 .. 0x4DBF,
+                                                    0x4E00 .. 0x9FFF,
+                                                    0xF900 .. 0xFAFF,
+                                                    0x20000 .. 0x2A6DF,
+                                                    0x2A700 .. 0x2B73F,
+                                                    0x2B740 .. 0x2B81F,
+                                                    0x2B820 .. 0x2CEAF,
+                                                    0x2F800 .. 0x2FA1F,
+                                                    0x20000 .. 0x2FFFD,
+                                                    0x30000 .. 0x3FFFD,
+                                                ]
+                                             );
+        push @special_delta_properties, $eaw;
+        $restricted_delta{$special_delta_properties[-1]}
+                                                       = ~ $default_EA_width_W;
+
+        # Line break came along in the same release as East_Asian_Width, and
+        # the non-grab-bag default set is a superset of the EAW one.
+        if (defined (my $lb = property_ref('Line_Break'))) {
+            my $default_LB_non_XX = Range_List->new(Initialize =>
+                                        # From comments in v8.0 LineBreak.txt
+                                                        [ 0x20A0 .. 0x20CF ]);
+            $default_LB_non_XX += $default_EA_width_W;
+            push @special_delta_properties, $lb;
+            $restricted_delta{$special_delta_properties[-1]}
+                                                        = ~ $default_LB_non_XX;
+        }
+    }
+
+    # Go through every property, skipping those we've already worked on, those
+    # that are immutable, and the perl ones that will be calculated after this
+    # routine has done its fixup.
+    foreach my $property (property_ref('*')) {
+        next if    $property == $perl     # Done later in the program
+                || $property == $block    # Done just above
+                || $property == $DI       # Done just above
+                || $property == $NChar    # Done just above
+
+                   # The next two are invariant across Unicode versions
+                || $property == property_ref('Pattern_Syntax')
+                || $property == property_ref('Pattern_White_Space');
+
+        #  Find the grab-bag value.
+        my $default_map = $property->default_map;
+
+        if (! $property->to_create_match_tables) {
+
+            # Here there aren't any match tables.  So far, all such properties
+            # have a default map, and don't require special handling.  Just
+            # change each newly assigned code point back to the default map,
+            # as if they were unassigned.
+            foreach my $range ($delta->ranges) {
+                $property->add_map($range->start,
+                                $range->end,
+                                $default_map,
+                                Replace => $UNCONDITIONALLY);
+            }
+        }
+        else {  # Here there are match tables.  Find the one (if any) for the
+                # grab-bag value that unassigned code points go to.
+            my $default_table;
+            if (defined $default_map) {
+                $default_table = $property->table($default_map);
+            }
+
+            # If some code points don't go back to the the grab-bag when they
+            # are considered unassigned, exclude them from the list that does
+            # that.
+            my $this_delta = $delta;
+            my $this_unchanged = $unchanged;
+            if (grep { $_ == $property } @special_delta_properties) {
+                $this_delta = $delta & $restricted_delta{$property};
+                $this_unchanged = ~ $this_delta;
+            }
+
+            # Fix up each match table for this property.
+            foreach my $table ($property->tables) {
+                if (defined $default_table && $table == $default_table) {
+
+                    # The code points assigned after release X (the ones we
+                    # are excluding in this routine) go back on to the default
+                    # (grab-bag) table.  However, some of these tables don't
+                    # actually exist, but are specified solely by the other
+                    # tables.  (In a binary property, we don't need to
+                    # actually have an 'N' table, as it's just the complement
+                    # of the 'Y' table.)  Such tables will be locked, so just
+                    # skip those.
+                    $table += $this_delta unless $table->locked;
+                }
+                else {
+
+                    # Here the table is not for the default value.  We need to
+                    # subtract the code points we are ignoring for this
+                    # comparison (the deltas) from it.  But if the table
+                    # started out with nothing, no need to exclude anything,
+                    # and want to skip it here anyway, so it gets listed
+                    # properly in the pod.
+                    next if $table->is_empty;
+
+                    # Save the deltas for later, before we do the subtraction
+                    my $deltas = $table & $this_delta;
+
+                    $table &= $this_unchanged;
+
+                    # Suppress the table if the subtraction left it with
+                    # nothing in it
+                    if ($table->is_empty) {
+                        if ($property->type == $BINARY) {
+                            push @tables_that_may_be_empty, 
$table->complete_name;
+                        }
+                        else {
+                            $table->set_fate($SUPPRESSED, 
$after_first_version);
+                        }
+                    }
+
+                    # Now we add the removed code points to the property's
+                    # map, as they should now map to the grab-bag default
+                    # property (which they did in the first comparison
+                    # version).  But we don't have to do this if the map is
+                    # only for internal use.
+                    if (defined $default_map && $property->to_output_map) {
+
+                        # The gc property has pseudo property values whose 
names
+                        # have length 1.  These are the union of all the
+                        # property values whose name is longer than 1 and
+                        # whose first letter is all the same.  The replacement
+                        # is done once for the longer-named tables.
+                        next if $property == $gc && length $table->name == 1;
+
+                        foreach my $range ($deltas->ranges) {
+                            $property->add_map($range->start,
+                                            $range->end,
+                                            $default_map,
+                                            Replace => $UNCONDITIONALLY);
+                        }
+                    }
+                }
+            }
+        }
+    }
+
+    # The above code doesn't work on 'gc=C', as it is a superset of the default
+    # ('Cn') table.  It's easiest to just special case it here.
+    my $C = $gc->table('C');
+    $C += $gc->table('Cn');
+
+    return;
+}
+
 sub compile_perl() {
     # Create perl-defined tables.  Almost all are part of the pseudo-property
     # named 'perl' internally to this program.  Many of these are recommended
@@ -14054,7 +14349,8 @@ sub compile_perl() {
                             );
 
     my $perl_surrogate = $perl->add_match_table('_Perl_Surrogate');
-    if (defined (my $Cs = $gc->table('Cs'))) {
+    my $Cs = $gc->table('Cs');
+    if (defined $Cs && ! $Cs->is_empty) {
         $perl_surrogate += $Cs;
     }
     else {
@@ -14541,7 +14837,7 @@ END
     ));
 
     # Construct the Present_In property from the Age property.
-    if (-e 'DAge.txt' && defined (my $age = property_ref('Age'))) {
+    if (-e 'DAge.txt' && defined $age) {
         my $default_map = $age->default_map;
         my $in = Property->new('In',
                                 Default_Map => $default_map,
@@ -14691,6 +14987,9 @@ END
         if ($v_version ge v2.0) {
             $quotemeta += $gc->table('Cf')
                        +  $gc->table('Cs');
+
+            # These are above the Unicode version 1 max
+            $quotemeta->add_range(0xE0000, 0xE0FFF);
         }
         $quotemeta += $gc->table('Cc')
                     - $Space;
@@ -14700,23 +14999,12 @@ END
                                                    0xFFF0 .. 0xFFFB,
                                                    0xE0000 .. 0xE0FFF,
                                                   ]);
-        $quotemeta += $temp & $Assigned;
+        $quotemeta += $temp;
     }
+    calculate_DI();
+    $quotemeta += $DI;
 
-    my $nchar = $perl->add_match_table('_Perl_Nchar',
-                                       Perl_Extension => 1,
-                                       Fate => $INTERNAL_ONLY);
-    if (defined (my $off_nchar = property_ref('Nchar'))) {
-        $nchar->initialize($off_nchar->table('Y'));
-    }
-    else {
-        $nchar->initialize([ 0xFFFE .. 0xFFFF ]);
-        if ($v_version ge v2.0) {   # First release with these nchars
-            for (my $i = 0x1FFFE; $i <= 0x10FFFE; $i += 0x10000) {
-                $nchar += [ $i .. $i+1 ];
-            }
-        }
-    }
+    calculate_NChar();
 
     # Finished creating all the perl properties.  All non-internal non-string
     # ones have a synonym of 'Is_' prefixed.  (Internal properties begin with
@@ -14743,9 +15031,7 @@ END
         # can give different annotations for each.
         $unassigned_sans_noncharacters = Range_List->new(
                                     Initialize => $gc->table('Unassigned'));
-        if (defined (my $nonchars = property_ref('Noncharacter_Code_Point'))) {
-            $unassigned_sans_noncharacters &= $nonchars->table('N');
-        }
+        $unassigned_sans_noncharacters &= (~ $NChar);
 
         for (my $i = 0; $i <= $MAX_UNICODE_CODEPOINT + 1; $i++ ) {
             $i = populate_char_info($i);    # Note sets $i so may cause skips
@@ -16219,7 +16505,7 @@ To change this file, edit $0 instead.
 
 =head1 NAME
 
-$pod_file - Index of Unicode Version $string_version character properties in 
Perl
+$pod_file - Index of Unicode Version $unicode_version character properties in 
Perl
 
 =head1 DESCRIPTION
 
@@ -18307,15 +18593,11 @@ my $Validation_Documentation = "Documentation of 
validation Tests";
 # This is a list of the input files and how to handle them.  The files are
 # processed in their order in this list.  Some reordering is possible if
 # desired, but the PropertyAliases and PropValueAliases files should be first,
-# and the extracted before the others except DAge.txt (as data in an extracted
-# file can be over-ridden by the non-extracted.  Some other files depend on
-# data derived from an earlier file, like UnicodeData requires data from Jamo,
-# and the case changing and folding requires data from Unicode.  Mostly, it is
-# safest to order by first version releases in (except the Jamo).  DAge.txt is
-# read before the extracted ones because of the rarely used feature
-# $compare_versions.  In the unlikely event that there were ever an extracted
-# file that contained the Age property information, it would have to go in
-# front of DAge.
+# and the extracted before the others (as data in an extracted file can be
+# over-ridden by the non-extracted.  Some other files depend on data derived
+# from an earlier file, like UnicodeData requires data from Jamo, and the case
+# changing and folding requires data from Unicode.  Mostly, it is safest to
+# order by first version releases in (except the Jamo).
 #
 # The version strings allow the program to know whether to expect a file or
 # not, but if a file exists in the directory, it will be processed, even if it
@@ -18337,10 +18619,6 @@ my @input_file_objects = (
                      Has_Missings_Defaults => $NOT_IGNORED,
                      Required_Even_in_Debug_Skip => 1,
                     ),
-    Input_file->new('DAge.txt', v3.2.0,
-                    Has_Missings_Defaults => $NOT_IGNORED,
-                    Property => 'Age'
-                   ),
     Input_file->new("${EXTRACTED}DGeneralCategory.txt", v3.1.0,
                     Property => 'General_Category',
                    ),
@@ -18562,6 +18840,10 @@ my @input_file_objects = (
                     Withdrawn => v5.1,
                     Skip => $Documentation,
                    ),
+    Input_file->new('DAge.txt', v3.2.0,
+                    Has_Missings_Defaults => $NOT_IGNORED,
+                    Property => 'Age'
+                   ),
     Input_file->new('HangulSyllableType.txt', v4.0,
                     Has_Missings_Defaults => $NOT_IGNORED,
                     Early => [ \&generate_hst, 'Hangul_Syllable_Type' ],
@@ -18799,7 +19081,7 @@ if (@missing_early_files) {
 
 The compilation cannot be completed because one or more required input files,
 listed below, are missing.  This is because you are compiling Unicode version
-$string_version, which predates the existence of these file(s).  To fully
+$unicode_version, which predates the existence of these file(s).  To fully
 function, perl needs the data that these files would have contained if they
 had been in this release.  To work around this, create copies of later
 versions of the missing files in the directory containing '$0'.  (Perl will
@@ -19040,6 +19322,11 @@ foreach my $file (@input_file_objects) {
 print "Finishing processing Unicode properties\n" if $verbosity >= $PROGRESS;
 finish_Unicode();
 
+# For the very specialized case of comparing two Unicode versions...
+if (DEBUG && $compare_versions) {
+    handle_compare_versions();
+}
+
 print "Compiling Perl properties\n" if $verbosity >= $PROGRESS;
 compile_perl();
 
diff --git a/regcharclass.h b/regcharclass.h
index f4b6298..72e25fe 100644
--- a/regcharclass.h
+++ b/regcharclass.h
@@ -2514,7 +2514,7 @@
  * 1a0687fb9c6c4567e853913549df0944fe40821279a3e9cdaa6ab8679bc286fd 
lib/unicore/extracted/DLineBreak.txt
  * 40bcfed3ca727c19e1331f6c33806231d5f7eeeabd2e6a9e06a3740c85d0c250 
lib/unicore/extracted/DNumType.txt
  * a18d502bad39d527ac5586d7bc93e29f565859e3bcc24ada627eff606d6f5fed 
lib/unicore/extracted/DNumValues.txt
- * ad739a46951b5f46396074b0682a2cfeed24b633a742a8e1aa0e337f69ef8b1c 
lib/unicore/mktables
+ * d5895407f73f1bcb0d7ad39e955c0e88a9145e401e868572849eeb134e669269 
lib/unicore/mktables
  * 462c9aaa608fb2014cd9649af1c5c009485c60b9c8b15b89401fdc10cf6161c6 
lib/unicore/version
  * c6884f4d629f04d1316f3476cb1050b6a1b98ca30c903262955d4eae337c6b1e 
regen/charset_translations.pl
  * d9c04ac46bdd81bb3e26519f2b8eb6242cb12337205add3f7cf092b0c58dccc4 
regen/regcharclass.pl

--
Perl5 Master Repository

[perl.git] branch blead, updated. v5.23.2-13-g356ad74

Reply via email to