My apologies for hammering on this, but I think it is quite important and
currently Python 3.0 seems confused about UCS-2 versus UTF-16.
-On [20080702 20:47], Guido van Rossum ([EMAIL PROTECTED]) wrote:
No, Python already is aware of surrogates. I meant applications
processing non-BMP text should
Hi,
Subsequently doing a: print a[1] to get the 0x942a (鐪) actually requires
a[2] on the 2-byte Python 3.0.
How is it annoying *in practice*? In actual code the index, instead of
being a constant, will be retrieved through various means such as .find()
or re.search().start()... as you show
Jeroen Ruigrok van der Werven wrote:
The documentation for len() says:
Return the length (the number of items) of an object.
So what this tells us is that in a UCS-2 build of Python, the items in
a unicode string are not, strictly speaking, Unicode code points or
characters. Instead, they
I think the discussion is going in the wrong direction:
The choice between UCS2 and UCS4 builds is really only meant
to enhance the possibility to interface to native OS or
application APIs, e.g. Windows LIBC and Java use UTF-16, glibc
on Unix uses UCS4.
The problem of slicing Unicode objects
On Thu, Jul 3, 2008 at 5:39 AM, Nick Coghlan [EMAIL PROTECTED] wrote:
1. If you are advocating disallowing the use of characters outside the BMP
in a UCS-2 build, enumerate the advantages of doing so (paying particular
attention to any advantages which cannot be obtained simply by using an
-On [20080703 15:00], M.-A. Lemburg ([EMAIL PROTECTED]) wrote:
Unicode if full of combining code points - if you break such a sequence,
the output will be just as wrong; regardless of UCS2 vs. UCS4.
In my opinion you are confusing two related, but very separated things here.
Combining characters
For programmers who want to target a 2-byte format (for win32
compatibility, for example)
As MAL said, this is taking the discussion in the wrong direction.
For people on Windows, win32 isn't a compatibility consideration. I
suspect most users of the other platforms MAL mentioned and all
On 2008-07-03 15:21, Jeroen Ruigrok van der Werven wrote:
-On [20080703 15:00], M.-A. Lemburg ([EMAIL PROTECTED]) wrote:
Unicode if full of combining code points - if you break such a sequence,
the output will be just as wrong; regardless of UCS2 vs. UCS4.
In my opinion you are confusing two
On Thu, Jul 3, 2008 at 3:48 AM, Jeroen Ruigrok van der Werven
[EMAIL PROTECTED] wrote:
My apologies for hammering on this, but I think it is quite important and
currently Python 3.0 seems confused about UCS-2 versus UTF-16.
[...]
Your seem to be suggesting that len(u\U00012345) should return 1
On Thu, Jul 3, 2008 at 6:42 AM, Mark Hammond [EMAIL PROTECTED] wrote:
For people on Windows, win32 isn't a compatibility consideration. I
suspect most users of the other platforms MAL mentioned and all others with
their own native unicode implementations would agree.
I'm sorry, but you're
-On [20080703 15:58], Guido van Rossum ([EMAIL PROTECTED]) wrote:
Your seem to be suggesting that len(u\U00012345) should return 1 on
a system that internally uses UTF-16 and hence represents this string
as a surrogate pair.
From a Unicode and UTF-16 point of view that makes the most sense. So
On Thu, Jul 3, 2008 at 7:46 AM, Jeroen Ruigrok van der Werven
[EMAIL PROTECTED] wrote:
-On [20080703 15:58], Guido van Rossum ([EMAIL PROTECTED]) wrote:
Your seem to be suggesting that len(u\U00012345) should return 1 on
a system that internally uses UTF-16 and hence represents this string
Hello,
2008/7/3 Guido van Rossum [EMAIL PROTECTED]:
I don't see an answer there to the question of whether the length()
method of a Java String object containing a single surrogate pair
returns 1 or 2; I suspect it returns 2. Python 3 supports things like
chr(0x12345) and ord(\U00012345).
On 03/07/2008, Guido van Rossum [EMAIL PROTECTED] wrote:
I don't see an answer there to the question of whether the length()
method of a Java String object containing a single surrogate pair
returns 1 or 2; I suspect it returns 2.
It appears you're right:
type testucs.java
class testucs {
Guido van Rossum guido at python.org writes:
The one thing that may be missing from Python is things like
interpretation of surrogates by functions like isalpha() and I'm okay
with adding that (since those have to loop over the entire string
anyway).
That and methods to safely iterate and
Paul Moore wrote:
On 03/07/2008, Guido van Rossum [EMAIL PROTECTED] wrote:
I don't see an answer there to the question of whether the length()
method of a Java String object containing a single surrogate pair
returns 1 or 2; I suspect it returns 2.
It appears you're right:
type testucs.java
-On [20080703 17:32], Paul Moore ([EMAIL PROTECTED]) wrote:
System.out.println(s.length());
I think you want to use codePointCount() to count the Unicode code points.
length() returns Unicode code units.
As http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html explains
On Thu, Jul 3, 2008 at 9:35 AM, Steve Holden [EMAIL PROTECTED] wrote:
Paul Moore wrote:
On 03/07/2008, Guido van Rossum [EMAIL PROTECTED] wrote:
I don't see an answer there to the question of whether the length()
method of a Java String object containing a single surrogate pair
returns 1 or
-On [20080703 17:03], Guido van Rossum ([EMAIL PROTECTED]) wrote:
I don't see an answer there to the question of whether the length()
method of a Java String object containing a single surrogate pair
returns 1 or 2; I suspect it returns 2.
As
http://java.sun.com/j2se/1.5.0/docs/api/java/lang
-On [20080703 18:45], James Y Knight ([EMAIL PROTECTED]) wrote:
I think this is misguided.
Only trying to at least correct the current situation, which I consider a
bit of a mess, personally. (Although it seems others share my view.)
I'd like to have 3 levels of access available:
1) byte-level
On Jul 3, 2008, at 10:46 AM, Jeroen Ruigrok van der Werven wrote:
-On [20080703 15:58], Guido van Rossum ([EMAIL PROTECTED]) wrote:
Your seem to be suggesting that len(u\U00012345) should return 1 on
a system that internally uses UTF-16 and hence represents this string
as a surrogate pair
On Thu, Jul 3, 2008 at 10:01 AM, Jeroen Ruigrok van der Werven
[EMAIL PROTECTED] wrote:
What would the chances for inclusion in Python be if such a PEP + code would
be presented Guido?
As long as it is clear that the len() function and the basic slicing
and indexing operations on strings
(sorry for the crossposting)
Do you know what happened with http://us.pycon.org/;?
Thank you!
--
. Facundo
Blog: http://www.taniquetil.com.ar/plog/
PyAr: http://www.python.org/ar/
___
Python-Dev mailing list
Python-Dev@python.org
On Thu, Jul 3, 2008 at 7:57 AM, M.-A. Lemburg [EMAIL PROTECTED] wrote:
On 2008-07-03 15:21, Jeroen Ruigrok van der Werven wrote:
-On [20080703 15:00], M.-A. Lemburg ([EMAIL PROTECTED]) wrote:
Unicode if full of combining code points - if you break such a sequence,
the output will be just
Basically everything but string forming or string printing seems to be
broken for surrogate pairs, from what I can tell.
We probably disagree what it works correctly means. I think everything
works correctly.
Also, I think you are confused about slicing in the middle of a surrogate
pair,
On Thu, Jul 3, 2008 at 13:12, Facundo Batista [EMAIL PROTECTED] wrote:
(sorry for the crossposting)
Do you know what happened with http://us.pycon.org/;?
Not sure. The machine is still up (it serves www.pycon.org as well).
Either something is misconfigured, or a process can't start, or
1. System is NOT memory limited (i.e. most desktops): use a UCS-4 Python
build, which is what most Linux distributions do (I'm not sure about the
pydotorg provided Windows or Mac OS X builds).
The Windows builds must continue to use a two-byte representation, as
otherwise PythonWin will break
-On [20080703 19:21], Adam Olsen ([EMAIL PROTECTED]) wrote:
On Thu, Jul 3, 2008 at 7:57 AM, M.-A. Lemburg [EMAIL PROTECTED] wrote:
Please remember that lone surrogate pair code points are perfectly
valid Unicode code points, nevertheless. Just as a lone combining
code point is valid on its own
Please remember that lone surrogate pair code points are perfectly
valid Unicode code points, nevertheless. Just as a lone combining
code point is valid on its own.
Actually, I think they aren't (not any more than an invalid codepoint,
or an unassigned codepoint). They are reserved for UTF-16
I think you want to use codePointCount() to count the Unicode code points.
length() returns Unicode code units.
As http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html explains:
In the J2SE API documentation, Unicode code point is used for character
values in the range between
Surely it's desirable under all circumstances that
len(u) == sum(1 for c in u)
and that
[c for c in u] == [c[i] for i in range(*len(u))]
How would that play under Jeroen's proposed change?
Yes, but I think the argument is about what c is -- a character or a
codepoint. Your
Daniel Arbuckle wrote:
Regardless, as I said before, nothing justifies silently changing the
meaning of a program based on an option that most users don't set for
themselves and are not aware of.
The premise of this thread seems to be that the majority should suffer
for the benefit of a
On Thu, Jul 3, 2008 at 10:44 AM, Terry Reedy [EMAIL PROTECTED] wrote:
The premise of this thread seems to be that the majority should suffer for
the benefit of a few. That is not Python's philosophy.
Who are the many here? Who are the few? I'd venture that (at least for
the foreseeable future,
In Montana visiting. Will be back at the hotel in about 4 hours. Looks
like base site include is missing or has wrong permissions.
On 7/3/08, David Goodger [EMAIL PROTECTED] wrote:
On Thu, Jul 3, 2008 at 13:12, Facundo Batista [EMAIL PROTECTED]
wrote:
(sorry for the crossposting)
Do you know
-On [20080703 19:31], Martin v. Löwis ([EMAIL PROTECTED]) wrote:
Yes, but it is two code units. Python's UTF-16 implementation operates
on code units, not code points.
Thank you, that is the single most important piece of information I got
about this entire thing because it does change the entire
On Thu, Jul 3, 2008 at 13:32, David Goodger [EMAIL PROTECTED] wrote:
On Thu, Jul 3, 2008 at 13:12, Facundo Batista [EMAIL PROTECTED] wrote:
(sorry for the crossposting)
Do you know what happened with http://us.pycon.org/;?
Not sure. The machine is still up (it serves www.pycon.org as well).
On Thu, Jul 3, 2008 at 11:35 AM, Jeroen Ruigrok van der Werven
[EMAIL PROTECTED] wrote:
-On [20080703 19:21], Adam Olsen ([EMAIL PROTECTED]) wrote:
On Thu, Jul 3, 2008 at 7:57 AM, M.-A. Lemburg [EMAIL PROTECTED] wrote:
Please remember that lone surrogate pair code points are perfectly
valid
On 2008-07-03 19:21, Adam Olsen wrote:
On Thu, Jul 3, 2008 at 7:57 AM, M.-A. Lemburg [EMAIL PROTECTED] wrote:
On 2008-07-03 15:21, Jeroen Ruigrok van der Werven wrote:
-On [20080703 15:00], M.-A. Lemburg ([EMAIL PROTECTED]) wrote:
Unicode if full of combining code points - if you break
On 2008-07-03 19:35, Jeroen Ruigrok van der Werven wrote:
-On [20080703 19:21], Adam Olsen ([EMAIL PROTECTED]) wrote:
On Thu, Jul 3, 2008 at 7:57 AM, M.-A. Lemburg [EMAIL PROTECTED] wrote:
Please remember that lone surrogate pair code points are perfectly
valid Unicode code points
2008/7/3 David Goodger [EMAIL PROTECTED]:
Jeff fixed it. URL rewriting was off by mistake.
Thanks! :)
--
. Facundo
Blog: http://www.taniquetil.com.ar/plog/
PyAr: http://www.python.org/ar/
___
Python-Dev mailing list
Python-Dev@python.org
On 2008-07-03 19:44, Terry Reedy wrote:
The premise of this thread seems to be that the majority should suffer
for the benefit of a few. That is not Python's philosophy.
In reality, most Unixes ship with UCS4 builds of Python. Windows
and Mac OS X ship with UCS2 builds. Still, anyone is free
I've grabbed the latest libffi that contains support for the ARM processor.
I then enable FFI_CLOSURES in the arm/ffi.c file.
When I do this, I get compilation errors that it is missing
ffi_prep_closure.
Is ffi.c up to date for supporting the ARM platform?
Not sure if there is a simple
M.-A. Lemburg wrote:
On 2008-07-03 19:44, Terry Reedy wrote:
The premise of this thread seems to be that the majority should suffer
for the benefit of a few. That is not Python's philosophy.
In reality, most Unixes ship with UCS4 builds of Python. Windows
and Mac OS X ship with UCS2 builds.
Guido van Rossum wrote:
On Thu, Jul 3, 2008 at 10:44 AM, Terry Reedy [EMAIL PROTECTED] wrote:
The premise of this thread seems to be that the majority should suffer for
the benefit of a few. That is not Python's philosophy.
The premise is the OP's idea that Python should switch to all UCS4
Thanks for any help.
This list (python-dev) is not for getting help, but for providing it.
So if you have patches that you would like to discuss, please go
ahead. As you are seeking help, please use [EMAIL PROTECTED]
(aka news:comp.lang.python) instead.
Regards,
Martin
On Thu, Jul 3, 2008 at 3:01 PM, Terry Reedy [EMAIL PROTECTED] wrote:
The premise is the OP's idea that Python should switch to all UCS4 to create
a more pure ('ideal') situation or the idea that len(s) should count
codepoints (correct term?) for all builds as a matter of purity even though
on
On Thu, Jul 3, 2008 at 3:00 PM, Adam Olsen [EMAIL PROTECTED] wrote:
On Thu, Jul 3, 2008 at 3:01 PM, Terry Reedy [EMAIL PROTECTED] wrote:
The premise is the OP's idea that Python should switch to all UCS4 to create
a more pure ('ideal') situation or the idea that len(s) should count
codepoints
Wrong term - code units and code points are equivalent in UTF-16 and
UTF-32. What you're looking for is unicode scalar values.
How so? Section 2.5, UTF-16 says
code points in the supplementary planes, in the range
U+1..U+10, are represented as pairs of 16-bit code units.
So clearly,
On Thu, Jul 3, 2008 at 4:21 PM, Guido van Rossum [EMAIL PROTECTED] wrote:
On Thu, Jul 3, 2008 at 3:00 PM, Adam Olsen [EMAIL PROTECTED] wrote:
On Thu, Jul 3, 2008 at 3:01 PM, Terry Reedy [EMAIL PROTECTED] wrote:
The premise is the OP's idea that Python should switch to all UCS4 to create
a
On Thu, Jul 3, 2008 at 4:50 PM, Adam Olsen [EMAIL PROTECTED] wrote:
Clearly, each surrogate is a valid code point, regardless of encoding.
A surrogate pair simultaneously represents both one code point (the
scalar value) and two code points (the surrogate code points). To be
unambiguous you
50 matches
Mail list logo