Author: allison
Date: Tue Apr  1 16:19:29 2008
New Revision: 26697

Modified:
   trunk/docs/pdds/draft/pdd28_character_sets.pod

Log:
[pdd] A few more clarifications to the Strings PDD, while responding to mailing
list comments.


Modified: trunk/docs/pdds/draft/pdd28_character_sets.pod
==============================================================================
--- trunk/docs/pdds/draft/pdd28_character_sets.pod      (original)
+++ trunk/docs/pdds/draft/pdd28_character_sets.pod      Tue Apr  1 16:19:29 2008
@@ -26,11 +26,10 @@
 
 =head2 Character Set
 
-The Unicode Standard has deprecated the term character set, preferring the
-concepts of I<character repertoire> (a collection of characters) and
-I<character code> (a mapping which tells you what number represents which
-character in the repertoire). We still use it, though, to mean the standard
-which defines both a repertoire and a code. 
+The Unicode Standard prefers the concepts of I<character repertoire> (a
+collection of characters) and I<character code> (a mapping which tells you what
+number represents which character in the repertoire). Character set is commonly
+used to mean the standard which defines both a repertoire and a code. 
 
 =head2 Codepoint
 
@@ -65,12 +64,11 @@
 number, punctuation mark, kanji, hiragana, Arabic glyph, Devanagari symbol,
 etc), including any modifiers (diacritics, etc).
 
-We've adopted the term grapheme to refer to one or more characters forming a
-visible whole when displayed, in other words, a bundle of a character and all
-of its combining characters. Parrot must support languages which manipulate
-strings grapheme-by-grapheme, and since graphemes are the highest-level
-interpretation of a "character", they're useful for converting between
-character sets.
+The Unicode Standard defines a I<grapheme cluster> (commonly simplified to just
+I<graheme>) as one or more characters forming a visible whole when displayed,
+in other words, a bundle of a character and all of its combining characters.
+Since graphemes are the highest-level abstract idea of a "character", they're
+useful for converting between character sets.
 
 =head2 Normalization Form
 
@@ -106,7 +104,7 @@
 =item *
 
 Operations that require understanding the semantics of a string must respect
-the character set (character repertoire and character code) of the string.
+the character set of the string.
 
 =item *
 
@@ -124,8 +122,9 @@
 
 Parrot was designed from the outset to support multiple string formats:
 multiple character sets and multiple encodings. We don't standardize on Unicode
-internally, because for the majority of use cases, it's still far more
-efficient to deal with whatever input data the user sends us.
+internally, converting all strings to Unicode strings, because for the majority
+of use cases it's still far more efficient to deal with whatever input data the
+user sends us.
 
 Consumers of Parrot strings need to be aware that there is a plurality of
 string encodings inside Parrot. (Producers of Parrot strings can do whatever is
@@ -294,6 +293,9 @@
 http://www.unicode.org/reports/tr15/ - The Unicode Consortium's
 explanation of different normalization forms.
 
+http://unicode.org/reports/tr29/ - "grapheme clusters" in the Unicode Standard
+Annex
+
 "Unicode: A Primer", Tony Graham - Arguably the most readable book on
 how Unicode works.
 

Reply via email to