[Python-Dev] Divorcing str and unicode (no more implicit conversions).

Bengt Richter Sat, 22 Oct 2005 20:07:49 -0700

Please bear with me for a few paragraphs ;-)

One aspect of str-type strings is the efficiency afforded when all the encoding 
really
is ascii. If the internal encoding were e.g. fixed utf-16le for strings, maybe 
with today's
computers it would still be efficient enough for most actual string purposes 
(excluding
the current use of str-strings as byte sequences).


I.e., you'd still have to identify what was "strings" (of characters) and what 
was really
byte sequences with no implied or explicit encoding or character semantics.

Ok, let's make that distinction explicit: Call one kind of string a byte 
sequence and the
other a character sequence (representation being a separate issue).

A unicode object is of course the prime _general_ representation of a character 
sequence
in Python, but all the names in python source code (that become NAME tokens) 
are UIAM
also character sequences, and representable by a byte sequence interpreted 
according to
ascii encoding.

For the sake of discussion, suppose we had another _character_ sequence type 
that was
the moral equivalent of unicode except for internal representation, namely a str
subclass with an encoding attribute specifying the encoding that you _could_ use
to decode the str bytes part to get unicode (which you wouldn't do except when 
necessary).
We could call it class charstr(str): ... and have chrstr().bytes be the str 
part and
chrstr().encoding specify the encoding part.

In all the contexts where we have obvious encoding information, we can then 
generate
a charstr instead of a str. E.g., if the source of module_a has

    # -*- coding: latin1 -*-
    cs = 'über-cool'
then
    type(cs)  # => <type 'charstr'>
    cs.bytes  # => '\xfcber-cool'
    cs.encoding # => 'latin-1'

and print cs would act like print cs.bytes.decode(cs.encoding) -- or I guess
    sys.stdout.write(cs.bytes.decode(cs.encoding).encode(sys.stdout.encoding)
followed by
    sys.stdout.write('\n'.decode('ascii').encode(sys.stdout.encoding)
for the newline of the print.

Now if module_b has

    # -*- coding: utf8 -*-
    cs = 'über-cool'

and we interactively
    import module_a, module_b
and then
    print module_a.cs + ' =?= ' + module_b.cs

what could happen ideally vs. what we have currently?
UIAM, currently we would just get the concatenation of
the three str byte sequences concatenated to make
    '\xfcber-cool =?= \xc3\xbcber-cool'
and that would be printed as whatever that comes out as
without conversion when seen by the output according to
sys.stdout.encoding.

But if those cs instances had been charstr instances, the coding cookie
encoding information would have been preserved, and the interactive print could
have evaluated the string expression -- given cs.decode() as sugar for
    (cs.bytes.decode(cs.encoding or globals().get('__encoding__') or
         __import__('sys').getdefaultencoding()))
-- as

    module_a.cs.decode() + ' =?= '.decode() + module_b.cs.decode()

if pairwise terms differ in encoding as they might all here. If the interactive
session source were e.g. latin-1, like module_a, then
    module_a.cs + ' =?= '
would not require an encoding change, because the ' =?= ' would be a charstr 
instance
with encoding == 'latin-1', and so the result would still be latin-1 that far.
But with module_b.cs being utf8, the next addition would cause the .decode() 
promotions
to unicode. In a console window, the ' =?= '.encoding might be 'cp437' or such, 
and
the first addition would then cause promotion (since module_a.cs.encoding != 
'cp437').

I have sneaked in run-time access to individual modules' encodings by assuming 
that
the encoding cookie could be compiled in as an explicit global __encoding__ 
variable
for any given module (what to have as __encoding__ for built-in modules could 
vary for
various purposes).

ISTM this could have use in situations where an encoding assumption is 
necessary and
currently 'ascii' is not as good a guess as one could make, though I suspect if 
string
literals became charstr strings instead of str strings, many if not most of 
those situations
would disappear (I'm saying this because ATM I can't think of an 'ascii'-guess 
situation that
wouldn't go away ;-) If there were a charchr() version of chr() that would 
result in
a charstr instead of a str, IWT one would want an easy-sugar default encoding 
assumption,
probably based on the same as one would assume for '%c' % num in a given module 
source
-- which presumably would be '%c'.encoding, where '%c' assumes the encoding of 
the module
source, normally recorded in __encoding__. So charchr(n) would act like 
chr(n).decode().encode(''.encoding) -- or more reasonably charstr(chr(n)), 
which would be
short for
    charstr(chr(n), globals().get('__encoding__') or 
__import__('sys').getdefaultencoding())
Or some efficient equivalent ;-)

Using strings in dicts requires hashing to find key comparison candidates and 
comparison to
check for key equivalence. This would seem to point to some kind of normalized 
hashing, but
not necessarily normalized key representation. Some is apparently happening, 
since
 >>> hash('a') == hash(unicode('a'))
 True

I don't know what would be worth the trouble to optimize string key usage where 
strings are
really all of one encoding vs totally general use vs a heavily biased mix. Or 
even if it could
be done without unreasonable complexity. Maybe a dict could be given an option 
to hash all
its keys as unicode vs whatever it does now. But having a charstr subtype of 
str would improve
the "implicit" conversions to unicode IMO.

Anyway, I wanted to throw in my .02USD re the implicit conversions, taking the 
view that
much of the implicitness could be based on reliable inferences from source 
encodings of
string literals or from their effects as format strings.

Regards,
Bengt Richter
[not a normal subscriber to python-dev, so I'll have to google for any 
responses]

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] Divorcing str and unicode (no more implicit conversions).

Reply via email to