Change 20939 by [EMAIL PROTECTED] on 2003/08/29 20:28:49
Integrate:
[ 20935]
Tiny doc tweak from Shannon -jj Behrens.
[ 20936]
Some perluniintro tweaks.
[ 20937]
Subject: [PATCH] Re: all 9007199254740992s are equal, but some are more equal
than others
From: Nicholas Clark <[EMAIL PROTECTED]>
Date: Wed, 27 Aug 2003 22:59:55 +0100
Message-ID: <[EMAIL PROTECTED]>
Affected files ...
... //depot/maint-5.8/perl/pod/perluniintro.pod#10 integrate
... //depot/maint-5.8/perl/pp.c#28 integrate
Differences ...
==== //depot/maint-5.8/perl/pod/perluniintro.pod#10 (text) ====
Index: perl/pod/perluniintro.pod
--- perl/pod/perluniintro.pod#9~20818~ Thu Aug 21 22:54:24 2003
+++ perl/pod/perluniintro.pod Fri Aug 29 13:28:49 2003
@@ -19,6 +19,7 @@
in the largest Chinese, Japanese, and Korean dictionaries are also
encoded. The standards will eventually cover almost all characters in
more than 250 writing systems and thousands of languages.
+Unicode 1.0 was released in October 1991, and 4.0 in April 2003.
A Unicode I<character> is an abstract entity. It is not bound to any
particular integer width, especially not to the C language C<char>.
@@ -33,11 +34,10 @@
I<code points>.
The Unicode standard prefers using hexadecimal notation for the code
-points. If numbers like C<0x0041> are unfamiliar to
-you, take a peek at a later section, L</"Hexadecimal Notation">.
-The Unicode standard uses the notation C<U+0041 LATIN CAPITAL LETTER A>,
-to give the hexadecimal code point and the normative name of
-the character.
+points. If numbers like C<0x0041> are unfamiliar to you, take a peek
+at a later section, L</"Hexadecimal Notation">. The Unicode standard
+uses the notation C<U+0041 LATIN CAPITAL LETTER A>, to give the
+hexadecimal code point and the normative name of the character.
Unicode also defines various I<properties> for the characters, like
"uppercase" or "lowercase", "decimal digit", or "punctuation";
@@ -86,12 +86,13 @@
A common myth about Unicode is that it would be "16-bit", that is,
Unicode is only represented as C<0x10000> (or 65536) characters from
-C<0x0000> to C<0xFFFF>. B<This is untrue.> Since Unicode 2.0, Unicode
-has been defined all the way up to 21 bits (C<0x10FFFF>), and since
-Unicode 3.1, characters have been defined beyond C<0xFFFF>. The first
-C<0x10000> characters are called the I<Plane 0>, or the I<Basic
-Multilingual Plane> (BMP). With Unicode 3.1, 17 planes in all are
-defined--but nowhere near full of defined characters, yet.
+C<0x0000> to C<0xFFFF>. B<This is untrue.> Since Unicode 2.0 (July
+1996), Unicode has been defined all the way up to 21 bits (C<0x10FFFF>),
+and since Unicode 3.1 (March 2001), characters have been defined
+beyond C<0xFFFF>. The first C<0x10000> characters are called the
+I<Plane 0>, or the I<Basic Multilingual Plane> (BMP). With Unicode
+3.1, 17 (yes, seventeen) planes in all were defined--but they are
+nowhere near full of defined characters, yet.
Another myth is that the 256-character blocks have something to
do with languages--that each block would define the characters used
@@ -104,13 +105,14 @@
For further information see L<Unicode::UCD>.
The Unicode code points are just abstract numbers. To input and
-output these abstract numbers, the numbers must be I<encoded> somehow.
-Unicode defines several I<character encoding forms>, of which I<UTF-8>
-is perhaps the most popular. UTF-8 is a variable length encoding that
-encodes Unicode characters as 1 to 6 bytes (only 4 with the currently
-defined characters). Other encodings include UTF-16 and UTF-32 and their
-big- and little-endian variants (UTF-8 is byte-order independent)
-The ISO/IEC 10646 defines the UCS-2 and UCS-4 encoding forms.
+output these abstract numbers, the numbers must be I<encoded> or
+I<serialised> somehow. Unicode defines several I<character encoding
+forms>, of which I<UTF-8> is perhaps the most popular. UTF-8 is a
+variable length encoding that encodes Unicode characters as 1 to 6
+bytes (only 4 with the currently defined characters). Other encodings
+include UTF-16 and UTF-32 and their big- and little-endian variants
+(UTF-8 is byte-order independent) The ISO/IEC 10646 defines the UCS-2
+and UCS-4 encoding forms.
For more information about encodings--for instance, to learn what
I<surrogates> and I<byte order marks> (BOMs) are--see L<perlunicode>.
@@ -645,8 +647,8 @@
$b = "\x{100}";
print "$a = $b\n";
-the output string will be UTF-8-encoded C<ab\x80c\x{100}\n>, but note
-that C<$a> will stay byte-encoded.
+the output string will be UTF-8-encoded C<ab\x80c = \x{100}\n>, but
+C<$a> will stay byte-encoded.
Sometimes you might really need to know the byte length of a string
instead of the character length. For that use either the
@@ -752,7 +754,10 @@
How Does Unicode Work With Traditional Locales?
In Perl, not very well. Avoid using locales through the C<locale>
-pragma. Use only one or the other.
+pragma. Use only one or the other. But see L<perlrun> for the
+description of the C<-C> switch and its environment counterpart,
+C<$ENV{PERL_UNICODE}> to see how to enable various Unicode features,
+for example by using locale settings.
=back
@@ -876,7 +881,8 @@
=head1 SEE ALSO
L<perlunicode>, L<Encode>, L<encoding>, L<open>, L<utf8>, L<bytes>,
-L<perlretut>, L<Unicode::Collate>, L<Unicode::Normalize>, L<Unicode::UCD>
+L<perlretut>, L<perlrun>, L<Unicode::Collate>, L<Unicode::Normalize>,
+L<Unicode::UCD>
=head1 ACKNOWLEDGMENTS
==== //depot/maint-5.8/perl/pp.c#28 (text) ====
Index: perl/pp.c
--- perl/pp.c#27~20468~ Sun Aug 3 22:16:18 2003
+++ perl/pp.c Fri Aug 29 13:28:49 2003
@@ -2766,28 +2766,6 @@
}
}
-/*
- * There are strange code-generation bugs caused on sparc64 by gcc-2.95.2.
- * These need to be revisited when a newer toolchain becomes available.
- */
-#if defined(__sparc64__) && defined(__GNUC__)
-# if __GNUC__ < 2 || (__GNUC__ == 2 && __GNUC_MINOR__ < 96)
-# undef SPARC64_MODF_WORKAROUND
-# define SPARC64_MODF_WORKAROUND 1
-# endif
-#endif
-
-#if defined(SPARC64_MODF_WORKAROUND)
-static NV
-sparc64_workaround_modf(NV theVal, NV *theIntRes)
-{
- NV res, ret;
- ret = Perl_modf(theVal, &res);
- *theIntRes = res;
- return ret;
-}
-#endif
-
PP(pp_int)
{
dSP; dTARGET; tryAMAGICun(int);
@@ -2811,34 +2789,15 @@
if (value < (NV)UV_MAX + 0.5) {
SETu(U_V(value));
} else {
-#if defined(SPARC64_MODF_WORKAROUND)
- (void)sparc64_workaround_modf(value, &value);
-#elif defined(HAS_MODFL_POW32_BUG)
-/* some versions of glibc split (i + d) into (i-1, d+1) for 2^32 <= i < 2^64 */
- NV offset = Perl_modf(value, &value);
- (void)Perl_modf(offset, &offset);
- value += offset;
-#else
- (void)Perl_modf(value, &value);
-#endif
- SETn(value);
+ SETn(Perl_floor(value));
}
}
else {
if (value > (NV)IV_MIN - 0.5) {
SETi(I_V(value));
} else {
-#if defined(SPARC64_MODF_WORKAROUND)
- (void)sparc64_workaround_modf(-value, &value);
-#elif defined(HAS_MODFL_POW32_BUG)
-/* some versions of glibc split (i + d) into (i-1, d+1) for 2^32 <= i < 2^64 */
- NV offset = Perl_modf(-value, &value);
- (void)Perl_modf(offset, &offset);
- value += offset;
-#else
- (void)Perl_modf(-value, &value);
-#endif
- SETn(-value);
+ /* This is maint, and we don't have Perl_ceil in perl.h */
+ SETn(-Perl_floor(-value));
}
}
}
End of Patch.