Re: [Python-3000] PEP 3138- String representation in Python 3000

M.-A. Lemburg Wed, 14 May 2008 09:19:10 -0700

Atuso

you are not really addressing my arguments in your reply.

My main concern is that repr(unicode) as well as '%r' is used
a lot in logging and debugging of applications.

In the 2.x series of Python, the output of repr() has traditionally
always been plain ASCII and does not require any special encoding
and also doesn't run into problems when mixing the output with
other encodings used in the log file, on the console or whereever
the output of repr() is sent.

You are now suggesting to break this convention by allowing
all printable code points to be used in the repr() output.
Depending on where you send the repr() output and the contents
of the PyUnicode object, this will likely result in exceptions
in the .write() method of the stream object.

Just adjusting sys.stdout and sys.stderr to prevent them from
falling over is not enough (and is indeed not within the scope
of the PEP, since those changes are *major* and not warranted
for just getting your Unicode repr() to work). repr() is very
often written to log files and those would all have to be
changed as well.

Now, as I've said before, I can see your point about wanting
to be able to read the Unicode code points, even if you use
repr() - instead of the more straight-forward .encode()
approach. However, when suggesting such changes, you always
have to see the other side as well:

 - Are there alternative ways to get the "problem" fixed ?
 - Is the added convenience worth breaking existing conventions ?
 - Is it worth breaking existing applications ?

I've suggested making the repr() output configurable to address
the convenience aspect of your proposal. You could then set the
output encoding to e.g. "unicode-printable" and get your preferred
output. The default could remain set to the current all-ASCII output.

Hardwiring the encoding is not a good idea, esp. since there
are lots of alternatives for you to get readable output from
PyUnicode object now and without any changes to the interpreter.

E.g.

print '%s' % u.encode('utf-8')

or

print '%s' % u.encode('shift-jis')

or

logfile = open('my.log', encoding='unicode-printable')
logfile.write(u)

or

def unicode_repr(u):
    return u.encode('unicode-printable')
print '%s' % unicode_repr(u)

There are many ways to solve your problem.

In summary, I am:

 -1 on hardwiring the unicode repr() output to a non-ASCII
    encoding

 +1 on adding the PyUnicode_ISPRINTABLE() API

 +1 on adding a unicode-printable codec which implements
    your suggested encoding, so that you can use it for e.g.
    log files or as sys.stdout encoding

 +0 on making unicode repr() encoding adjustable

Regards,
--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 14 2008)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611

On 2008-05-09 19:23, Atsuo Ishimoto wrote:

On Fri, May 9, 2008 at 1:52 AM, M.-A. Lemburg <[EMAIL PROTECTED]> wrote:

 For sys.stdout this doesn't make sense at all, since it hides encoding
 errors for all applications using sys.stdout as piping mechanism.
 -1 on that.

You can raise UnicodeEncodigError for encoding errors if you want, by
setting sys.stdout's error-handler to `strict`.

No, that's not a good idea. I don't want to change every single
affected application just to make sure that they don't write
corrupt data to stdout.


The changes you need to make for your applications will be so small
that I don't think this is valid argument.
And number of applications you need to change will be rather small.
What you call  "corrupt data" are just hex-escaped characters of
foreign language. In most case, printing(or writing to file) such
string doesn't harm, so I think raising exception by default is
overkill. Java doesn't raise exception for encoding error, but just
print `?`. .NET languages such as C# also prints '?'. Perl prints
hex-escaped string, as proposed in this PEP.

Even though this PEP was rejected,

You mean PEP 3138 was rejected ??


Er, I should have written "Even if this PEP was ...", perhaps.

Well, "annoying" is not good enough for such a big change :-)


So? Annoyance of Perl was enough reason to change entire language for me :-)

The backslashreplace idea may have some merrits in interactive
Python sessions or IDLE, but it hides encoding errors in all
other situations.


Encoding errors are not hidden, but are represented by hex-escaped
strings. We can get much more information about the string being
printed than printing tracebacks.

I'm not against changing the repr() of Unicode objects, but
please make sure that this change does not break debugging
Python applications.Whether you're debugging an app using
'print' statements, piping repr() through a socket to a remote
debugger or writing information to a log file. The important
factor to take into account is the other end that will receive
the data.


I think your request is too vague to be completed. This proposal
improve current broken debugging for me, and I see no lost information
for debugging. But the "other end" may be too vary to say something.

BTW: One problem that your PEP doesn't address, which I mentioned
on the ticket:

By putting all printable chars into the repr() you lose the
ability to actually see the number of code points you have
in a Unicode string.


With current repr(), I can not get any information other than number
of code points. This is not what I want to know by printing repr().
For length of the string, I'll just do print(len(s)).

Please name the property Py_UNICODE_ISPRINTABLE. Py_UNICODE_ISHEXESCAPED
isn't all that intuitive.


The name `Py_UNICODE_ISPRINTABLE` came to my mind at first, but I was
not sure the `printable`  is accurate word. I'm okay for
Py_UNICODE_ISPRINTABLE, but I'd like to hear opinions. If no one
objects Py_UNICODE_ISPRINTABLE, I'll go for it.

How can things easily be changed so that it's possible to get the
Py2.x style hex escaping back into Py3k without having to change
all repr() calls and %r format markers for Unicode objects ?


I didn't intend to imply "without having to change".  Perhaps,
"migrate" would be wrong word and "port" may be better.

For repr() and %r format, they are unlikely to be changed in most
case. They need to be changed if pure ASCII are required even if your
locale is capable to print the strings.

I can see your point with it being easier to read e.g. German,
Japanese or Korean data, but it still has to be possible to
use repr() for proper debugging which allows the user to
actually see what is stored in a Unicode object in terms of
code points.


You can see code points easily, the function I wrote in the PEP to
convert such strings as repr() in Python 2 is good example. But I
believe ordinary use-case prefer readable string over code points.


_______________________________________________
Python-3000 mailing list
Python-3000@python.org
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Re: [Python-3000] PEP 3138- String representation in Python 3000

Reply via email to