In perl.git, the branch blead has been updated <http://perl5.git.perl.org/perl.git/commitdiff/c707cf8e56262e82a832f4b1eceb109bee32ec3a?hp=65484cb9211d2894650050b07ff45b8d35811520>
- Log ----------------------------------------------------------------- commit c707cf8e56262e82a832f4b1eceb109bee32ec3a Author: Karl Williamson <[email protected]> Date: Thu Mar 3 19:10:06 2011 -0700 UCD.pm: All code points are in some block Code points that are not in a block are considered to be in the pseudo-block 'No_Block' by the Unicode standard; so change to do that instead of 'undef' M lib/Unicode/UCD.pm M lib/Unicode/UCD.t commit 8079ad8217fc56d7ca13c7f6bae25f75e0762e6a Author: Karl Williamson <[email protected]> Date: Thu Mar 3 19:02:37 2011 -0700 UCD.pm: All code points have a script Unassigned code points have the script 'Unknown'; not undef M lib/Unicode/UCD.pm M lib/Unicode/UCD.t M pod/perldelta.pod commit f200dd127985c7f429f992446014c65eeff2db20 Author: Karl Williamson <[email protected]> Date: Thu Mar 3 19:01:34 2011 -0700 UCD.pm: Nits in pod M lib/Unicode/UCD.pm commit 4f66642efa00c6bb7fdc4e2f096ddc55e6206cb6 Author: Karl Williamson <[email protected]> Date: Thu Mar 3 18:42:30 2011 -0700 UCD.pm: Fix typos in pod M lib/Unicode/UCD.pm commit 05dbc6f80f8f2d5774f53874803f5a20450bbe82 Author: Karl Williamson <[email protected]> Date: Thu Mar 3 18:33:18 2011 -0700 UCD.pm: Remove reliance on UnicodeData.txt In doing so, there were a number of bug fixes made, as it now relies on files processed by mktables, which has intelligence to fix a number of problems with UnicodeData.txt. This is essentially a rewrite of charinfo(). It previously had hard-coded the ranges in UnicodeData.txt, instead of examining the file to see what was there. This had not been updated for some time, and was out-of-date, with the result that the newer ranges (all CJK) were quite wrong. The new code does not have such reliance, and so new versions of Unicode should not break this, like they previously would This may be slower than what was previously there, as it reads several smaller files instead of one very large one. But the principal reason to do this work was to save disk space. It was previously thought that the function could continue to use UnicodeData.txt if it exists on the machine, but this would have required fixing all the bugs that this automatically fixes by using the processed files. M lib/Unicode/UCD.pm M lib/Unicode/UCD.t M pod/perldelta.pod commit 36c2430c54431c750134fb5add2327486301d66f Author: Karl Williamson <[email protected]> Date: Thu Mar 3 18:02:03 2011 -0700 UCD.pm: Use subclassed warnings 5.14 subclasses some UTF8 warnings, so that they can be turned off more precisely. M lib/Unicode/UCD.pm commit f3d50ac98736b7b6864856ac20359ed14bee064f Author: Karl Williamson <[email protected]> Date: Thu Mar 3 18:00:08 2011 -0700 UCD.pm: Use traditional casing for script names For some reason UCD.pm has lowercased the first letters of the non-first word in script names. For backwards compatibility, continue to do so. M lib/Unicode/UCD.pm commit eb01e50bb8634515813de5d0b3d41067d81d8bf7 Author: Karl Williamson <[email protected]> Date: Thu Mar 3 17:57:29 2011 -0700 mktables: Write Unicode_1_Name table for UCD.pm M lib/unicore/mktables commit fcf1973cc859c6bebd1a7cdf8740f690f33617a1 Author: Karl Williamson <[email protected]> Date: Thu Mar 3 17:48:04 2011 -0700 mktables: Add override for map tables output This adds a hash so can more precisely control which map tables get output and which are documented. The hash is populated to suppress some messages and some tables that are redundant. M lib/unicore/mktables commit 6c0259ad7f0bfd1649f9df7e74facc87075433a7 Author: Karl Williamson <[email protected]> Date: Thu Mar 3 17:31:15 2011 -0700 mktables: White-space only The previous patch introduced a closure, and this patch indents the code in that closure. M lib/unicore/mktables commit d3fed3ddb4fba47f1acee223d6c48283d9e2133f Author: Karl Williamson <[email protected]> Date: Thu Mar 3 17:20:37 2011 -0700 mktables: Add tables of just simple case foldings This adds three tables for lc, uc, and title, which are the simple mappings that are overridden by full mappings. These are quite tiny, and will be used by UCD.pm to avoid using UnicodeData.txt M lib/unicore/mktables commit 9e92970c614d4054e0dc8b57cbf767b673b2774b Author: Karl Williamson <[email protected]> Date: Thu Mar 3 16:53:20 2011 -0700 UCD.t: Add test for non-Unicode code point M lib/Unicode/UCD.t commit 4fa6d9d34c528acc4258813903a1b5f60d507034 Author: Karl Williamson <[email protected]> Date: Thu Mar 3 16:48:47 2011 -0700 UCD.pm" remove no longer used variable M lib/Unicode/UCD.pm commit 272501f63b387e4545389d735ab1cec6cbb76cb6 Author: Karl Williamson <[email protected]> Date: Thu Mar 3 16:27:31 2011 -0700 mktables: Move some definitions to earlier M lib/unicore/mktables commit a441186611883aac7143f037e8e10d314482028c Author: Karl Williamson <[email protected]> Date: Thu Mar 3 16:20:24 2011 -0700 UCD.t: Fix a test description M lib/Unicode/UCD.t commit 06bba7d54b7eb05ccfa474032017480e14b18374 Author: Karl Williamson <[email protected]> Date: Thu Mar 3 16:17:55 2011 -0700 UCD.pm: Nits in pod and comment M lib/Unicode/UCD.pm ----------------------------------------------------------------------- Summary of changes: lib/Unicode/UCD.pm | 306 +++++++++++++++++++++++++++----------------------- lib/Unicode/UCD.t | 21 ++-- lib/unicore/mktables | 237 ++++++++++++++++++++++++--------------- pod/perldelta.pod | 45 ++++++++ 4 files changed, 372 insertions(+), 237 deletions(-) diff --git a/lib/Unicode/UCD.pm b/lib/Unicode/UCD.pm index d0eead3..30acd50 100644 --- a/lib/Unicode/UCD.pm +++ b/lib/Unicode/UCD.pm @@ -2,7 +2,9 @@ package Unicode::UCD; use strict; use warnings; +no warnings 'surrogate'; # surrogates can be inputs to this use charnames (); +use Unicode::Normalize qw(getCombinClass NFKD); our $VERSION = '0.32'; @@ -84,15 +86,12 @@ a decimal or a hexadecimal scalar designating a Unicode code point, or C<U+> followed by hexadecimals designating a Unicode code point. In other words, if you want a code point to be interpreted as a hexadecimal number, you must prefix it with either C<0x> or C<U+>, because a string like e.g. C<123> will be -interpreted as a decimal code point. Also note that Unicode is B<not> limited -to 16 bits (the number of Unicode code points is open-ended, in theory -unlimited): you may have more than 4 hexdigits. +interpreted as a decimal code point. Note that the largest code point in +Unicode is U+10FFFF. =cut -my $UNICODEFH; my $BLOCKSFH; my $VERSIONFH; -my $COMPEXCLFH; my $CASEFOLDFH; my $CASESPECFH; my $NAMEDSEQFH; @@ -175,6 +174,9 @@ I<code>. Each has at least four hexdigits. The codes may be preceded by a word enclosed in angle brackets then a space, like C<E<lt>compatE<gt> >, giving the type of decomposition +This decomposition may be an intermediate one whose components are also +decomposable. Use L<Unicode::Normalize> to get the final decomposition. + =item B<decimal> if I<code> is a decimal digit this is its integer numeric value @@ -204,7 +206,8 @@ As of Unicode 6.0, this is always empty. =item B<upper> -is empty if there is no single code point uppercase mapping for I<code>; +is empty if there is no single code point uppercase mapping for I<code> +(its uppercase mapping is itself); otherwise it is that mapping expressed as at least four hexdigits. (L</casespec()> should be used in addition to B<charinfo()> for case mappings when the calling program can cope with multiple code point @@ -212,7 +215,8 @@ mappings.) =item B<lower> -is empty if there is no single code point lowercase mapping for I<code>; +is empty if there is no single code point lowercase mapping for I<code> +(its lowercase mapping is itself); otherwise it is that mapping expressed as at least four hexdigits. (L</casespec()> should be used in addition to B<charinfo()> for case mappings when the calling program can cope with multiple code point @@ -220,7 +224,8 @@ mappings.) =item B<title> -is empty if there is no single code point titlecase mapping for I<code>; +is empty if there is no single code point titlecase mapping for I<code> +(its titlecase mapping is itself); otherwise it is that mapping expressed as at least four hexdigits. (L</casespec()> should be used in addition to B<charinfo()> for case mappings when the calling program can cope with multiple code point @@ -258,134 +263,152 @@ sub _getcode { return; } -# Lingua::KO::Hangul::Util not part of the standard distribution -# but it will be used if available. - -eval { require Lingua::KO::Hangul::Util }; -my $hasHangulUtil = ! $@; -if ($hasHangulUtil) { - Lingua::KO::Hangul::Util->import(); +# Populated by _num. Converts real number back to input rational +my %real_to_rational; + +# To store the contents of files found on disk. +my @BIDIS; +my @CATEGORIES; +my @DECOMPOSITIONS; +my @NUMERIC_TYPES; +my @SIMPLE_LOWER; +my @SIMPLE_TITLE; +my @SIMPLE_UPPER; +my @UNICODE_1_NAMES; + +sub _charinfo_case { + + # Returns the value to set into one of the case fields in the charinfo + # structure. + # $char is the character, + # $cased is the case-changed character + # $file is the file in lib/unicore/To/$file that contains the data + # needed for this, in the form that _search() understands. + # $array_ref points to the array holding the contents of $file. It will + # be populated if empty. + # By using the 'uc', etc. functions, we avoid loading more files into + # memory except for those rare cases where the simple casing (which has + # been what charinfo() has always returned, is different than the full + # casing. + my ($char, $cased, $file, $array_ref) = @_; + + return "" if $cased eq $char; + + return sprintf("%04X", ord $cased) if length($cased) == 1; + + @$array_ref =_read_table("unicore/To/$file") unless @$array_ref; + return _search($array_ref, 0, $#$array_ref, ord $char) // ""; } -sub hangul_decomp { # internal: called from charinfo - if ($hasHangulUtil) { - my @tmp = decomposeHangul(shift); - return sprintf("%04X %04X", @tmp) if @tmp == 2; - return sprintf("%04X %04X %04X", @tmp) if @tmp == 3; - } - return; -} - -sub hangul_charname { # internal: called from charinfo - return sprintf("HANGUL SYLLABLE-%04X", shift); -} - -sub han_charname { # internal: called from charinfo - return sprintf("CJK UNIFIED IDEOGRAPH-%04X", shift); -} +sub charinfo { -# Overwritten by data in file -my %first_last = ( - 'CJK Ideograph Extension A' => [ 0x3400, 0x4DB5 ], - 'CJK Ideograph' => [ 0x4E00, 0x9FA5 ], - 'CJK Ideograph Extension B' => [ 0x20000, 0x2A6D6 ], -); - -get_charinfo_ranges(); - -sub get_charinfo_ranges { - my @blocks = keys %first_last; - - my $fh; - openunicode( \$fh, 'UnicodeData.txt' ); - if( defined $fh ){ - while( my $line = <$fh> ){ - next unless $line =~ /(?:First|Last)/; - if( grep{ $line =~ /[^;]+;<$_\s*,\s*(?:First|Last)>/ }@blocks ){ - my ($number,$block,$type); - ($number,$block) = split /;/, $line; - $block =~ s/<|>//g; - ($block,$type) = split /, /, $block; - my $index = $type eq 'First' ? 0 : 1; - $first_last{ $block }->[$index] = hex $number; - } - } - } -} + # This function has traditionally mimicked what is in UnicodeData.txt, + # warts and all. This is a re-write that avoids UnicodeData.txt so that + # it can be removed to save disk space. Instead, this assembles + # information gotten by other methods that get data from various other + # files. It uses charnames to get the character name; and various + # mktables tables. -my @CharinfoRanges = ( -# block name -# [ first, last, coderef to name, coderef to decompose ], -# CJK Ideographs Extension A - [ @{ $first_last{'CJK Ideograph Extension A'} }, \&han_charname, undef ], -# CJK Ideographs - [ @{ $first_last{'CJK Ideograph'} }, \&han_charname, undef ], -# Hangul Syllables - [ 0xAC00, 0xD7A3, $hasHangulUtil ? \&getHangulName : \&hangul_charname, \&hangul_decomp ], -# Non-Private Use High Surrogates - [ 0xD800, 0xDB7F, undef, undef ], -# Private Use High Surrogates - [ 0xDB80, 0xDBFF, undef, undef ], -# Low Surrogates - [ 0xDC00, 0xDFFF, undef, undef ], -# The Private Use Area - [ 0xE000, 0xF8FF, undef, undef ], -# CJK Ideographs Extension B - [ @{ $first_last{'CJK Ideograph Extension B'} }, \&han_charname, undef ], -# Plane 15 Private Use Area - [ 0xF0000, 0xFFFFD, undef, undef ], -# Plane 16 Private Use Area - [ 0x100000, 0x10FFFD, undef, undef ], -); + use feature 'unicode_strings'; -sub charinfo { my $arg = shift; my $code = _getcode($arg); - croak __PACKAGE__, "::charinfo: unknown code '$arg'" - unless defined $code; - my $hexk = sprintf("%06X", $code); - my($rcode,$rname,$rdec); - foreach my $range (@CharinfoRanges){ - if ($range->[0] <= $code && $code <= $range->[1]) { - $rcode = $hexk; - $rcode =~ s/^0+//; - $rcode = sprintf("%04X", hex($rcode)); - $rname = $range->[2] ? $range->[2]->($code) : ''; - $rdec = $range->[3] ? $range->[3]->($code) : ''; - $hexk = sprintf("%06X", $range->[0]); # replace by the first - last; - } + croak __PACKAGE__, "::charinfo: unknown code '$arg'" unless defined $code; + + # Non-unicode implies undef. + return if $code > 0x10FFFF; + + my %prop; + my $char = chr($code); + + @CATEGORIES =_read_table("unicore/To/Gc.pl") unless @CATEGORIES; + $prop{'category'} = _search(\@CATEGORIES, 0, $#CATEGORIES, $code) + // $utf8::SwashInfo{'ToGc'}{'missing'}; + + return if $prop{'category'} eq 'Cn'; # Unassigned code points are undef + + $prop{'code'} = sprintf "%04X", $code; + $prop{'name'} = ($char =~ /\p{Cntrl}/) ? '<control>' + : (charnames::viacode($code) // ""); + + $prop{'combining'} = getCombinClass($code); + + @BIDIS =_read_table("unicore/To/Bc.pl") unless @BIDIS; + $prop{'bidi'} = _search(\@BIDIS, 0, $#BIDIS, $code) + // $utf8::SwashInfo{'ToBc'}{'missing'}; + + # For most code points, we can just read in "unicore/Decomposition.pl", as + # its contents are exactly what should be output. But that file doesn't + # contain the data for the Hangul syllable decompositions, which can be + # algorithmically computed, and NFKD() does that, so we call NFKD() for + # those. We can't use NFKD() for everything, as it does a complete + # recursive decomposition, and what this function has always done is to + # return what's in UnicodeData.txt which doesn't have the recursivenss + # specified. + # in the decomposition types. No decomposition implies an empty field; + # otherwise, all but "Canonical" imply a compatible decomposition, and + # the type is prefixed to that, as it is in UnicodeData.txt + if ($char =~ /\p{Block=Hangul_Syllables}/) { + # The code points of the decomposition are output in standard Unicode + # hex format, separated by blanks. + $prop{'decomposition'} = join " ", map { sprintf("%04X", $_)} + unpack "U*", NFKD($char); } - openunicode(\$UNICODEFH, "UnicodeData.txt"); - if (defined $UNICODEFH) { - use Search::Dict 1.02; - if (look($UNICODEFH, "$hexk;", { xfrm => sub { $_[0] =~ /^([^;]+);(.+)/; sprintf "%06X;$2", hex($1) } } ) >= 0) { - my $line = <$UNICODEFH>; - return unless defined $line; - chomp $line; - my %prop; - @prop{qw( - code name category - combining bidi decomposition - decimal digit numeric - mirrored unicode10 comment - upper lower title - )} = split(/;/, $line, -1); - $hexk =~ s/^0+//; - $hexk = sprintf("%04X", hex($hexk)); - if ($prop{code} eq $hexk) { - $prop{block} = charblock($code); - $prop{script} = charscript($code); - if(defined $rname){ - $prop{code} = $rcode; - $prop{name} = $rname; - $prop{decomposition} = $rdec; - } - return \%prop; - } - } + else { + @DECOMPOSITIONS = _read_table("unicore/Decomposition.pl") + unless @DECOMPOSITIONS; + $prop{'decomposition'} = _search(\@DECOMPOSITIONS, 0, $#DECOMPOSITIONS, + $code) // ""; } - return; + + # Can use num() to get the numeric values, if any. + if (! defined (my $value = num($char))) { + $prop{'decimal'} = $prop{'digit'} = $prop{'numeric'} = ""; + } + else { + if ($char =~ /\d/) { + $prop{'decimal'} = $prop{'digit'} = $prop{'numeric'} = $value; + } + else { + + # For non-decimal-digits, we have to read in the Numeric type + # to distinguish them. It is not just a matter of integer vs. + # rational, as some whole number values are not considered digits, + # e.g., TAMIL NUMBER TEN. + $prop{'decimal'} = ""; + + @NUMERIC_TYPES =_read_table("unicore/To/Nt.pl") + unless @NUMERIC_TYPES; + if ((_search(\@NUMERIC_TYPES, 0, $#NUMERIC_TYPES, $code) // "") + eq 'Digit') + { + $prop{'digit'} = $prop{'numeric'} = $value; + } + else { + $prop{'digit'} = ""; + $prop{'numeric'} = $real_to_rational{$value} // $value; + } + } + } + + $prop{'mirrored'} = ($char =~ /\p{Bidi_Mirrored}/) ? 'Y' : 'N'; + + @UNICODE_1_NAMES =_read_table("unicore/To/Na1.pl") unless @UNICODE_1_NAMES; + $prop{'unicode10'} = _search(\@UNICODE_1_NAMES, 0, $#UNICODE_1_NAMES, $code) + // ""; + + # This is true starting in 6.0, but, num() also requires 6.0, so + # don't need to test for version again here. + $prop{'comment'} = ""; + + $prop{'upper'} = _charinfo_case($char, uc $char, '_suc.pl', \@SIMPLE_UPPER); + $prop{'lower'} = _charinfo_case($char, lc $char, '_slc.pl', \@SIMPLE_LOWER); + $prop{'title'} = _charinfo_case($char, ucfirst $char, '_stc.pl', + \@SIMPLE_TITLE); + + $prop{block} = charblock($code); + $prop{script} = charscript($code); + return \%prop; } sub _search { # Binary search in a [[lo,hi,prop],[...],...] table. @@ -480,6 +503,9 @@ my @BLOCKS; my %BLOCKS; sub _charblocks { + + # Can't read from the mktables table because it loses the hyphens in the + # original. unless (@BLOCKS) { if (openunicode(\$BLOCKSFH, "Blocks.txt")) { local $_; @@ -504,13 +530,12 @@ sub charblock { my $code = _getcode($arg); if (defined $code) { - _search(\@BLOCKS, 0, $#BLOCKS, $code); - } else { - if (exists $BLOCKS{$arg}) { - return dclone $BLOCKS{$arg}; - } else { - return; - } + my $result = _search(\@BLOCKS, 0, $#BLOCKS, $code); + return $result if defined $result; + return 'No_Block'; + } + elsif (exists $BLOCKS{$arg}) { + return dclone $BLOCKS{$arg}; } } @@ -545,6 +570,7 @@ my %SCRIPTS; sub _charscripts { @SCRIPTS =_read_table("unicore/To/Sc.pl") unless @SCRIPTS; foreach my $entry (@SCRIPTS) { + $entry->[2] =~ s/(_\w)/\L$1/g; # Preserve old-style casing push @{$SCRIPTS{$entry->[2]}}, $entry; } } @@ -559,7 +585,7 @@ sub charscript { if (defined $code) { my $result = _search(\@SCRIPTS, 0, $#SCRIPTS, $code); return $result if defined $result; - #return $utf8::SwashInfo{'ToSc'}{'missing'}; + return $utf8::SwashInfo{'ToSc'}{'missing'}; } elsif (exists $SCRIPTS{$arg}) { return dclone $SCRIPTS{$arg}; } @@ -606,7 +632,7 @@ sub charscripts { =head2 B<charinrange()> -In addition to using the C<\p{In...}> and C<\P{In...}> constructs, you +In addition to using the C<\p{Blk=...}> and C<\P{Blk=...}> constructs, you can also test whether a code point is in the I<range> as returned by L</charblock()> and L</charscript()> or as the values of the hash returned by L</charblocks()> and L</charscripts()> by using charinrange(): @@ -767,7 +793,7 @@ sub compexcl { croak __PACKAGE__, "::compexcl: unknown code '$arg'" unless defined $code; - no warnings "utf8"; # So works on surrogates and non-Unicode code points + no warnings "non_unicode"; # So works on non-Unicode code points return chr($code) =~ /\p{Composition_Exclusion}/; } @@ -1218,9 +1244,11 @@ sub _numeric { foreach my $entry (@numbers) { my ($start, $end, $value) = @$entry; - # If value contains a slash, convert to decimal + # If value contains a slash, convert to decimal, add a reverse hash + # used by charinfo. if ((my @rational = split /\//, $value) == 2) { my $real = $rational[0] / $rational[1]; + $real_to_rational{$real} = $value; $value = $real; } @@ -1366,7 +1394,7 @@ For scripts see UTR #24: L<http://www.unicode.org/unicode/reports/tr24/> Scripts are matched with the regular-expression construct C<\p{...}> (e.g. C<\p{Tibetan}> matches characters of the Tibetan script), -while C<\p{In...}> is used for blocks (e.g. C<\p{InTibetan}> matches +while C<\p{Blk=...}> is used for blocks (e.g. C<\p{Blk=Tibetan}> matches any of the 256 code points in the Tibetan block). diff --git a/lib/Unicode/UCD.t b/lib/Unicode/UCD.t index 933fbbf..9c57f38 100644 --- a/lib/Unicode/UCD.t +++ b/lib/Unicode/UCD.t @@ -17,12 +17,14 @@ use strict; use Unicode::UCD; use Test::More; -BEGIN { plan tests => 269 }; +BEGIN { plan tests => 271 }; use Unicode::UCD 'charinfo'; my $charinfo; +is(charinfo(0x110000), undef, "Verify charinfo() of non-unicode is undef"); + $charinfo = charinfo(0); # Null is often problematic, so test it. is($charinfo->{code}, '0000', '<control>'); @@ -131,12 +133,12 @@ is($charinfo->{script}, 'Hebrew'); $charinfo = charinfo(0xAC00); -is($charinfo->{code}, 'AC00', 'HANGUL SYLLABLE-AC00'); -is($charinfo->{name}, 'HANGUL SYLLABLE-AC00'); +is($charinfo->{code}, 'AC00', 'HANGUL SYLLABLE U+AC00'); +is($charinfo->{name}, 'HANGUL SYLLABLE GA'); is($charinfo->{category}, 'Lo'); is($charinfo->{combining}, '0'); is($charinfo->{bidi}, 'L'); -is($charinfo->{decomposition}, undef); +is($charinfo->{decomposition}, '1100 1161'); is($charinfo->{decimal}, ''); is($charinfo->{digit}, ''); is($charinfo->{numeric}, ''); @@ -153,12 +155,12 @@ is($charinfo->{script}, 'Hangul'); $charinfo = charinfo(0xAE00); -is($charinfo->{code}, 'AE00', 'HANGUL SYLLABLE-AE00'); -is($charinfo->{name}, 'HANGUL SYLLABLE-AE00'); +is($charinfo->{code}, 'AE00', 'HANGUL SYLLABLE U+AE00'); +is($charinfo->{name}, 'HANGUL SYLLABLE GEUL'); is($charinfo->{category}, 'Lo'); is($charinfo->{combining}, '0'); is($charinfo->{bidi}, 'L'); -is($charinfo->{decomposition}, undef); +is($charinfo->{decomposition}, "1100 1173 11AF"); is($charinfo->{decimal}, ''); is($charinfo->{digit}, ''); is($charinfo->{numeric}, ''); @@ -216,7 +218,8 @@ use Unicode::UCD qw(charblock charscript); # 0x0590 is in the Hebrew block but unused. is(charblock(0x590), 'Hebrew', '0x0590 - Hebrew unused charblock'); -is(charscript(0x590), undef, '0x0590 - Hebrew unused charscript'); +is(charscript(0x590), 'Unknown', '0x0590 - Hebrew unused charscript'); +is(charblock(0x1FFFF), 'No_Block', '0x1FFFF - unused charblock'); $charinfo = charinfo(0xbe); @@ -455,7 +458,7 @@ is(num("0"), 0, 'Verify num("0") == 0'); is(num("98765"), 98765, 'Verify num("98765") == 98765'); ok(! defined num("98765\N{FULLWIDTH DIGIT FOUR}"), 'Verify num("98765\N{FULLWIDTH DIGIT FOUR}") isnt defined'); is(num("\N{NEW TAI LUE DIGIT TWO}\N{NEW TAI LUE DIGIT ONE}"), 21, 'Verify \N{NEW TAI LUE DIGIT TWO}\N{NEW TAI LUE DIGIT ONE}" == 21'); -ok(! defined num("\N{NEW TAI LUE DIGIT TWO}\N{NEW TAI LUE THAM DIGIT ONE}"), 'Verify \N{NEW TAI LUE DIGIT TWO}\N{NEW TAI LUE DIGIT ONE}" isnt defined'); +ok(! defined num("\N{NEW TAI LUE DIGIT TWO}\N{NEW TAI LUE THAM DIGIT ONE}"), 'Verify \N{NEW TAI LUE DIGIT TWO}\N{NEW TAI LUE THAM DIGIT ONE}" isnt defined'); is(num("\N{CHAM DIGIT ZERO}\N{CHAM DIGIT THREE}"), 3, 'Verify num("\N{CHAM DIGIT ZERO}\N{CHAM DIGIT THREE}") == 3'); ok(! defined num("\N{CHAM DIGIT ZERO}\N{JAVANESE DIGIT NINE}"), 'Verify num("\N{CHAM DIGIT ZERO}\N{JAVANESE DIGIT NINE}") isnt defined'); is(num("\N{SUPERSCRIPT TWO}"), 2, 'Verify num("\N{SUPERSCRIPT TWO} == 2'); diff --git a/lib/unicore/mktables b/lib/unicore/mktables index c0ad2f1..90bdccc 100644 --- a/lib/unicore/mktables +++ b/lib/unicore/mktables @@ -839,6 +839,26 @@ if ($v_version ge v5.2.0) { } } +# Enum values for to_output_map() method in the Map_Table package. +my $EXTERNAL_MAP = 1; +my $INTERNAL_MAP = 2; + +# To override computed values for writing the map tables for these properties. +# The default for enum map tables is to write them out, so that the Unicode +# .txt files can be removed, but all the data to compute any property value +# for any code point is available in a more compact form. +my %global_to_output_map = ( + # Needed by UCD.pm, but don't want to publicize that it exists, so won't + # get stuck supporting it if things change. Sinc it is a STRING property, + # it normally would be listed in the pod, but INTERNAL_MAP suppresses + # that. + Unicode_1_Name => $INTERNAL_MAP, + + Present_In => 0, # Suppress, as easily computed from Age + Canonical_Combining_Class => 0, # Duplicate of CombiningClass.pl + Block => 0, # Suppress, as Blocks.txt is retained. +); + # Properties that this program ignores. my @unimplemented_properties = ( 'Unicode_Radical_Stroke' # Remove if changing to handle this one. @@ -880,7 +900,6 @@ my %why_obsolete; # Documentation only 'Decomposition_Mapping' => 'Accessible via Unicode::Normalize', 'ISO_Comment' => 'Apparently no demand for it, but can access it through Unicode::UCD::charinfo. Obsoleted, and code points for it removed in Unicode 5.2', - 'Unicode_1_Name' => "$simple, and no apparent demand for it, but can access it through Unicode::UCD::charinfo. If there is no later name for a code point, then this one is used instead in ch ... [9 chars truncated] 'Simple_Case_Folding' => "$simple. Can access this through Unicode::UCD::casefold", 'Simple_Lowercase_Mapping' => "$simple. Can access this through Unicode::UCD::charinfo", @@ -1072,10 +1091,6 @@ my $DEVELOPMENT_ONLY=<<"EOF"; EOF -# Enum values for to_output_map() method in the Map_Table package. -my $EXTERNAL_MAP = 1; -my $INTERNAL_MAP = 2; - my $LAST_UNICODE_CODEPOINT_STRING = "10FFFF"; my $LAST_UNICODE_CODEPOINT = hex $LAST_UNICODE_CODEPOINT_STRING; my $MAX_UNICODE_CODEPOINTS = $LAST_UNICODE_CODEPOINT + 1; @@ -5580,6 +5595,8 @@ sub trace { return main::trace(@_); } return $to_output_map{$addr} if defined $to_output_map{$addr}; my $full_name = $self->full_name; + return $global_to_output_map{$full_name} + if defined $global_to_output_map{$full_name}; # If table says to output, do so; if says to suppress it, do so. return $EXTERNAL_MAP if grep { $_ eq $full_name } @output_mapped_properties; @@ -10371,105 +10388,147 @@ sub filter_arabic_shaping_line { return; } -sub setup_special_casing { - # SpecialCasing.txt contains the non-simple case change mappings. The - # simple ones are in UnicodeData.txt, which should already have been read - # in to the full property data structures, so as to initialize these with - # the simple ones. Then the SpecialCasing.txt entries overwrite the ones - # which have different full mappings. - - # This routine sees if the simple mappings are to be output, and if so, - # copies what has already been put into the full mapping tables, while - # they still contain only the simple mappings. - - # The reason it is done this way is that the simple mappings are probably - # not going to be output, so it saves work to initialize the full tables - # with the simple mappings, and then overwrite those relatively few - # entries in them that have different full mappings, and thus skip the - # simple mapping tables altogether. - - my $file= shift; - Carp::carp_extra_args(\@_) if main::DEBUG && @_; +{ # Closure + my $lc; # Table for lowercase mapping + my $tc; + my $uc; + + sub setup_special_casing { + # SpecialCasing.txt contains the non-simple case change mappings. The + # simple ones are in UnicodeData.txt, which should already have been + # read in to the full property data structures, so as to initialize + # these with the simple ones. Then the SpecialCasing.txt entries + # overwrite the ones which have different full mappings. + + # This routine sees if the simple mappings are to be output, and if + # so, copies what has already been put into the full mapping tables, + # while they still contain only the simple mappings. + + # The reason it is done this way is that the simple mappings are + # probably not going to be output, so it saves work to initialize the + # full tables with the simple mappings, and then overwrite those + # relatively few entries in them that have different full mappings, + # and thus skip the simple mapping tables altogether. + + my $file= shift; + Carp::carp_extra_args(\@_) if main::DEBUG && @_; - # For each of the case change mappings... - foreach my $case ('lc', 'tc', 'uc') { - my $full = property_ref($case); - unless (defined $full && ! $full->is_empty) { - Carp::my_carp_bug("Need to process UnicodeData before SpecialCasing. Only special casing will be generated."); + $lc = property_ref('lc'); + $tc = property_ref('tc'); + $uc = property_ref('uc'); + + # For each of the case change mappings... + foreach my $case_table ($lc, $tc, $uc) { + my $case = $case_table->name; + my $full = property_ref($case); + unless (defined $full && ! $full->is_empty) { + Carp::my_carp_bug("Need to process UnicodeData before SpecialCasing. Only special casing will be generated."); + } + + # The simple version's name in each mapping merely has an 's' in + # front of the full one's + my $simple = property_ref('s' . $case); + $simple->initialize($full) if $simple->to_output_map(); + + my $simple_only = Property->new("_s$case", + Type => $STRING, + Default_Map => $CODE_POINT, + Perl_Extension => 1, + Description => "The simple mappings for $case for code points that have full mappings as well"); + $simple_only->set_to_output_map($INTERNAL_MAP); + $simple_only->add_comment(join_lines( <<END +This file is for UCD.pm so that it can construct simple mappings that would +otherwise be lost because they are overridden by full mappings. +END + )); } - # The simple version's name in each mapping merely has an 's' in front - # of the full one's - my $simple = property_ref('s' . $case); - $simple->initialize($full) if $simple->to_output_map(); + return; } - return; -} + sub filter_special_casing_line { + # Change the format of $_ from SpecialCasing.txt into something that + # the generic handler understands. Each input line contains three + # case mappings. This will generate three lines to pass to the + # generic handler for each of those. -sub filter_special_casing_line { - # Change the format of $_ from SpecialCasing.txt into something that the - # generic handler understands. Each input line contains three case - # mappings. This will generate three lines to pass to the generic handler - # for each of those. - - # The input syntax (after stripping comments and trailing white space is - # like one of the following (with the final two being entries that we - # ignore): - # 00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S - # 03A3; 03C2; 03A3; 03A3; Final_Sigma; - # 0307; ; 0307; 0307; tr After_I; # COMBINING DOT ABOVE - # Note the trailing semi-colon, unlike many of the input files. That - # means that there will be an extra null field generated by the split + # The input syntax (after stripping comments and trailing white space + # is like one of the following (with the final two being entries that + # we ignore): + # 00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S + # 03A3; 03C2; 03A3; 03A3; Final_Sigma; + # 0307; ; 0307; 0307; tr After_I; # COMBINING DOT ABOVE + # Note the trailing semi-colon, unlike many of the input files. That + # means that there will be an extra null field generated by the split - my $file = shift; - Carp::carp_extra_args(\@_) if main::DEBUG && @_; + my $file = shift; + Carp::carp_extra_args(\@_) if main::DEBUG && @_; - my @fields = split /\s*;\s*/, $_, -1; # -1 => retain trailing null fields + my @fields = split /\s*;\s*/, $_, -1; # -1 => retain trailing null + # fields + + # field #4 is when this mapping is conditional. If any of these get + # implemented, it would be by hard-coding in the casing functions in + # the Perl core, not through tables. But if there is a new condition + # we don't know about, output a warning. We know about all the + # conditions through 6.0 + if ($fields[4] ne "") { + my @conditions = split ' ', $fields[4]; + if ($conditions[0] ne 'tr' # We know that these languages have + # conditions, and some are multiple + && $conditions[0] ne 'az' + && $conditions[0] ne 'lt' + + # And, we know about a single condition Final_Sigma, but + # nothing else. + && ($v_version gt v5.2.0 + && (@conditions > 1 || $conditions[0] ne 'Final_Sigma'))) + { + $file->carp_bad_line("Unknown condition '$fields[4]'. You should inspect it and either add code to handle it, or add to list of those that are to ignore"); + } + elsif ($conditions[0] ne 'Final_Sigma') { - # field #4 is when this mapping is conditional. If any of these get - # implemented, it would be by hard-coding in the casing functions in the - # Perl core, not through tables. But if there is a new condition we don't - # know about, output a warning. We know about all the conditions through - # 6.0 - if ($fields[4] ne "") { - my @conditions = split ' ', $fields[4]; - if ($conditions[0] ne 'tr' # We know that these languages have - # conditions, and some are multiple - && $conditions[0] ne 'az' - && $conditions[0] ne 'lt' - - # And, we know about a single condition Final_Sigma, but - # nothing else. - && ($v_version gt v5.2.0 - && (@conditions > 1 || $conditions[0] ne 'Final_Sigma'))) - { - $file->carp_bad_line("Unknown condition '$fields[4]'. You should inspect it and either add code to handle it, or add to list of those that are to ignore"); + # Don't print out a message for Final_Sigma, because we + # have hard-coded handling for it. (But the standard + # could change what the rule should be, but it wouldn't + # show up here anyway. + + print "# SKIPPING Special Casing: $_\n" + if $verbosity >= $VERBOSE; + } + $_ = ""; + return; + } + elsif (@fields > 6 || (@fields == 6 && $fields[5] ne "" )) { + $file->carp_bad_line('Extra fields'); + $_ = ""; + return; } - elsif ($conditions[0] ne 'Final_Sigma') { - # Don't print out a message for Final_Sigma, because we have - # hard-coded handling for it. (But the standard could change - # what the rule should be, but it wouldn't show up here - # anyway. + $_ = "$fields[0]; lc; $fields[1]"; + $file->insert_adjusted_lines("$fields[0]; tc; $fields[2]"); + $file->insert_adjusted_lines("$fields[0]; uc; $fields[3]"); - print "# SKIPPING Special Casing: $_\n" - if $verbosity >= $VERBOSE; + # Copy any simple case change to the special tables constructed if + # being overridden by a multi-character case change. + if ($fields[1] ne $fields[0] + && (my $value = $lc->value_of(hex $fields[0])) ne $CODE_POINT) + { + $file->insert_adjusted_lines("$fields[0]; _slc; $value"); } - $_ = ""; - return; - } - elsif (@fields > 6 || (@fields == 6 && $fields[5] ne "" )) { - $file->carp_bad_line('Extra fields'); - $_ = ""; + if ($fields[2] ne $fields[0] + && (my $value = $tc->value_of(hex $fields[0])) ne $CODE_POINT) + { + $file->insert_adjusted_lines("$fields[0]; _stc; $value"); + } + if ($fields[3] ne $fields[0] + && (my $value = $uc->value_of(hex $fields[0])) ne $CODE_POINT) + { + $file->insert_adjusted_lines("$fields[0]; _suc; $value"); + } + return; } - - $_ = "$fields[0]; lc; $fields[1]"; - $file->insert_adjusted_lines("$fields[0]; tc; $fields[2]"); - $file->insert_adjusted_lines("$fields[0]; uc; $fields[3]"); - - return; } sub filter_old_style_case_folding { diff --git a/pod/perldelta.pod b/pod/perldelta.pod index bd2070a..93ad8ee 100644 --- a/pod/perldelta.pod +++ b/pod/perldelta.pod @@ -148,6 +148,51 @@ C<Test::Simple> has been upgraded from version 0.97_01 to 0.98 C<Tie::RefHash> has been upgraded from version 1.38 to 1.39. +=item * + +Unicode::UCD has been upgraded from version 0.31 to 0.32. +This includes a number of bug fixes: + +=over 4 + +=item charinfo() + +=over 4 + +=item * + +It is now updated to Unicode Version 6 with Corrigendum #8, except, +as with Perl 5.14, the code point at U+1F514 has no name. + +=item * + +The Hangul syllable code points have the correct names, and their +decompositions are always output without requiring L<Lingua::KO::Hangul::Util> +to be installed. + +=item * + +The CJK (Chinese-Japanese-Korean) code points U+2A700 - U+2B734 +and U+2B740 - 2B81D are now properly handled. + +=item * + +The numeric values are now output for those CJK code points that have them. + +=item * + +The names that are output for code points with multiple aliases are now the +corrected ones. + +=back + +=item charscript() + +This now correctly returns "Unknown" instead of C<undef> for the script +of a code point that hasn't been assigned another one. + +=back + =back =head2 Removed Modules and Pragmata -- Perl5 Master Repository
