On Wed, Jun 04, 2014 at 03:32:25PM +0000, Steve Dower wrote: > Steven D'Aprano wrote: > > The language semantics says that a string is an array of code points. Every > > index relates to a single code point, no code point extends over two or more > > indexes. > > There's a 1:1 relationship between code points and indexes. How is direct > > indexing "likely to be incorrect"? > > We're discussing the behaviour under a different (hypothetical) design > decision than a 1:1 relationship between code points and indexes, so > arguing from that stance doesn't make much sense.
I'm open to different implementations. I earlier even suggested that the choice of O(1) indexing versus O(N) indexing was a quality of implementation issue, not a make-or-break issue for whether something can call itself Python (or even 99% compatible with Python"). But I don't believe that exposing that implementation at the Python level is valid: regardless of whether it is efficient or not, I should be able to write code like this: a = [mystring[i] for i in range(len(mystring))] b = list(mystring) assert a == b That is not the case if you expose the underlying byte-level implementation at the Python level, and treat strings as an array of *bytes*. Paul seems to want to do this, or at least he wants Python 4 to do this. I think it is *completely* inappropriate to do so. I *think* you may agree with me, (correct me if I'm wrong) because you go on to agree with me: > > e.g. > > > > s = "---ÿ---" > > offset = s.index('ÿ') > > assert s[offset] == 'ÿ' > > > > That cannot fail with Python's semantics. > > Agreed, and it shouldn't but I'm not actually sure. > (I was actually referring to the optimization > being incorrect for the goal, not the language semantics). What you'd > probably find is that sizeof('ÿ') == sizeof(s[offset]) == 2, which may > be surprising, but is also correct. You don't seem to be taking about sys.getsizeof, so I guess you're talking about something at the C level (or other underlying implementation), ignoring the object overhead. I don't know why you think I'd find that surprising -- one cannot fit 0x10FFFF Unicode code points in a single byte, so whether you use UTF-32, UTF-16, UTF-8, Python 3.3's FSR or some other implementation, at least some code points are going to use more than one byte. > But what are you trying to achieve (why are you writing this code)? > All this example really shows is that you're only using indexing for > trivial purposes. I'm trying to understand what point you are trying to make, because I'm afraid I don't quite get it. [...] > If copying into a separate list is a problem (memory-wise), > re.finditer('\\S+', string) also provides the same behaviour and gives > me the sliced string, so there's no need to index for anything. finditer returns a bunch of MatchObjects, which give you the indexes of the found substring. Whether you do it yourself, or get the re module to do it, you're indexing somewhere. > The downside is that it isn't as easy to teach as the 1:1 > relationship, and currently it doesn't perform as well *in CPython*. > But if MicroPython is focusing on size over speed, I don't see any > reason why they shouldn't permit different performance characteristics > and require a slightly different approach to highly-optimized coding. I don't have a problem with different implementations, so long as that implementation isn't exposed at the Python level with changes of semantics such as breaking the promise that a string is an array of code points, not of bytes. > In any case, this is an interesting discussion with a genuine effect > on the Python interpreter ecosystem. Jython and IronPython already > have different string implementations from CPython - having official > (and hopefully flexible) guidance on deviations from the reference > implementation would I think help other implementations provide even > more value, which is only a good thing for Python. Yes, agreed. -- Steven _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com