Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

M.-A. Lemburg Mon, 24 Oct 2005 01:40:36 -0700

Neil Hodgson wrote:
> Guido van Rossum:
> 
> 
>>Folks, please focus on what Python 3000 should do.
>>
>>I'm thinking about making all character strings Unicode (possibly with
>>different internal representations a la NSString in Apple's Objective
>>C) and introduce a separate mutable bytes array data type. But I could
>>use some validation or feedback on this idea from actual
>>practitioners.
> 
> 
>    I'd like to more tightly define Unicode strings for Python 3000.
> Currently, Unicode strings may be implemented with either 2 byte
> (UCS-2) or 4 byte (UTF-32) elements. Python should allow strings to
> contain any Unicode character and should be indexable yielding
> characters rather than half characters. Therefore Python strings
> should appear to be UTF-32. There could still be multiple
> implementations (using UTF-16 or UTF-8) to preserve space but all
> implementations should appear to be the same apart from speed and
> memory use.


There seems to be a general misunderstanding here: even if you
have UCS4 storage, it is still possible to slice a Unicode
string in a way which makes rendering it correctly.

Unicode has the concept of combining code points, e.g. you can
store an "é" (e with a accent) as "e" + "'". Now if you slice
off the accent, you'll break the character that you encoded
using combining code points.

Note that combining code points are rather common in encodings
of Asian scripts, so this is not an artificial example.

Some time ago I proposed a new module called unicodeindex
to help with indexing. It would solve most of the indexing
issues you run into when dealing with Unicode. I've attached
it to this email for reference.

More on the used terms:

http://www.egenix.com/files/python/EuroPython2002-Python-and-Unicode.pdf
http://www.egenix.com/files/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 24 2005)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

PEP: 0XXX
Title: Unicode Indexing Helper Module
Version: $Revision: 1.0 $
Author: [EMAIL PROTECTED] (Marc-Andr Lemburg)
Status: Draft
Type: Standards Track
Python-Version: 2.3
Created: 06-Jun-2001
Post-History: 

Abstract

    This PEP proposes a new module "unicodeindex" which provides 
    means to index Unicode objects in various higher level abstractions
    of "characters".

Problem and Terminology

    Unicode objects can be indexed just like string object using what
    in Unicode terms is called a code unit as index basis.  

    Code units are the storage entities used by the Unicode
    implementation to store a single Unicode information unit and do
    not necessarily map 1-1 to code points which are the smallest
    entities encoded by the Unicode standard. Python exposes code
    units to the programmer via the Unicode object indexing and slicing
    API, e.g. u[10] or u[12:15] refer to the code units at index 10
    and indices 12 to 14.

    These code points can sometimes be composed to form graphemes
    which are then displayed by the Unicode output device as one
    character. A word is then a sequence of characters separated by
    space characters or punctuation, a line is a sequence of code
    points separated by line breaking code point sequences.

    For addressing Unicode, there are basically five different methods
    by which you can reference the data:

    1. per code unit    (codeunit)
    2. per code point   (codepoint)
    3. per grapheme     (grapheme)
    4. per word         (word)
    5. per line         (line)

    The indexing type name is given in parenthesis and used in the
    module interface.

Proposed Solution

    I propose to add a new module to the standard Python library which
    provides interfaces implementing the above indexing methods.

Module Interface

    The module should provide the following interfaces for all four
    indexing styles:

    next_<indextype>(u, index) -> integer

        Returns the Unicode object index for the start of the next
        <indextype> found after u[index] or -1 in case no next element
        of this type exists.

    prev_<indextype>(u, index) -> integer

        Returns the Unicode object index for the start of the previous
        <indextype> found before u[index] or -1 in case no previous
        element of this type exists.

    <indextype>_index(u, n) -> integer

        Returns the Unicode object index for the start of the n-th
        <indextype> element in u. Raises an IndexError in case no n-th
        element can be found.

    <indextype>_count(u, index) -> integer

        Counts the number of complete <indextype> elements found in
        u[:index] and returns the count as integer.

    <indextype>_start(u, index) -> integer

        Returns 1 or 0 depending on u[index] marks the start of an
        <indextype> element.

    <indextype>_end(u, index) -> integer

        Returns 1 or 0 depending on u[index] marks the end of an
        <indextype> element.

    <indextype>_slice(u, index) -> slice object or None

        Returns the slice pointing to the <indextype> element found in 
        u at the given index or None in case no such element can be found
        at that position.

    Symbols used in the above definitions:

       <indextype>   one of: codeunit, codepoint, grapheme, word, line
       u             is the Unicode object
       index         the Unicode object index, e.g. 10 in u[10]
       n             is an integer    

    Note that in Unicode terms, the Unicode object index refers to a
    code unit.

Copyright

    This document has been placed in the public domain.


Local Variables:
mode: indented-text
indent-tabs-mode: nil
End:

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

Reply via email to