Author: chromatic Date: Tue Apr 1 19:01:13 2008 New Revision: 26698 Modified: trunk/docs/pdds/draft/pdd28_character_sets.pod
Log: [PDD] Typo fixes and minor formatting nits. Modified: trunk/docs/pdds/draft/pdd28_character_sets.pod ============================================================================== --- trunk/docs/pdds/draft/pdd28_character_sets.pod (original) +++ trunk/docs/pdds/draft/pdd28_character_sets.pod Tue Apr 1 19:01:13 2008 @@ -29,7 +29,7 @@ The Unicode Standard prefers the concepts of I<character repertoire> (a collection of characters) and I<character code> (a mapping which tells you what number represents which character in the repertoire). Character set is commonly -used to mean the standard which defines both a repertoire and a code. +used to mean the standard which defines both a repertoire and a code. =head2 Codepoint @@ -38,7 +38,7 @@ =head2 Encoding -An encoding determines how a codepoint is represented inside a computer. +An encoding determines how a codepoint is represented inside a computer. Simple encodings like ASCII define that the codepoints 0-127 simply live as their numeric equivalents inside an eight-bit bytes. Other fixed-width encodings like UTF-16 use more bytes to encode more @@ -65,9 +65,9 @@ etc), including any modifiers (diacritics, etc). The Unicode Standard defines a I<grapheme cluster> (commonly simplified to just -I<graheme>) as one or more characters forming a visible whole when displayed, +I<grapheme>) as one or more characters forming a visible whole when displayed, in other words, a bundle of a character and all of its combining characters. -Since graphemes are the highest-level abstract idea of a "character", they're +Because graphemes are the highest-level abstract idea of a "character", they're useful for converting between character sets. =head2 Normalization Form @@ -98,7 +98,7 @@ =item * -Parrot provides an interface for interacting with strings and converting +Parrot provides an interface for interacting with strings and converting between character sets and encodings. =item * @@ -130,7 +130,7 @@ string encodings inside Parrot. (Producers of Parrot strings can do whatever is most efficient for them.) To put it in simple terms: if you find yourself writing C<*s++> or any other C string idioms, you need to stop and think if -that's what you really mean. Not everything is byte-based any more. +that's what you really mean. Not everything is byte-based anymore. =head2 Grapheme Normalization Form @@ -147,7 +147,7 @@ String operations on this kind of variable-byte encoding can be complex and expensive. Operations like comparison and traversal require a series of -computations and lookaheads, since any given grapheme may be a sequence of +computations and lookaheads, because any given grapheme may be a sequence of combining characters. The Unicode Standard defines several "normalization forms" that help with this problem. Normalization Form C (NFC), for example, decomposes everything, then re-composes as much as possible. So if you see the @@ -161,8 +161,8 @@ means that even in the most normalized Unicode form, string manipulation code must always assume a variable-byte encoding, and use expensive lookaheads. The cost is incurred on every operation, though the particular string operated on -might not contain combining characters. It's particularly noticable in parsing -and regular expression matches, where backtracking operations may retraverse +might not contain combining characters. It's particularly noticeable in parsing +and regular expression matches, where backtracking operations may re-traverse the characters of a simple string hundreds of times. In order to reduce the cost of variable-byte operations and simplify some @@ -243,22 +243,22 @@ push @grapheme_table, "\x{438}\x{30F}"; ~ $#grapheme_table; }); - push @string, $codepoint; + push @string, $codepoint; =head2 String API Strings have the following structure: struct parrot_string_t { - UnionVal cache; - Parrot_UInt flags; - char *strstart; - UINTVAL bufused; - UINTVAL strlen; - const struct _encoding *encoding; - const struct _charset *charset; + UnionVal cache; + Parrot_UInt flags; + UINTVAL bufused; + UINTVAL hashval; + UINTVAL strlen; + char *strstart; + const struct _encoding *encoding; + const struct _charset *charset; const struct _normalization *normalization; - UINTVAL hashval; }; Deprecation note: the enum C<parrot_string_representation_t> will be removed. @@ -270,7 +270,7 @@ Conversion will be done with a function called C<string_grapheme_copy>: - INTVAL string_grapheme_copy(STRING* src, STRING* dst) + INTVAL string_grapheme_copy(STRING *src, STRING *dst) Converting a string from one format to another involves creating a new empty string with the required attributes, and passing the source string and the new