Re: [Python-Dev] Internal representation of strings and Micropython

Glenn Linderman Wed, 04 Jun 2014 13:52:21 -0700

On 6/4/2014 6:14 AM, Steve Dower wrote:

I'm agree with Daniel. Directly indexing into text suggests anattempted optimization that is likely to be incorrect for a set ofstrings. Splitting, regex, concatenation and formatting are really themain operations that matter, and MicroPython can optimize theirimplementation of these easily enough for O(N) indexing.
Cheers,
Steve

Top-posted from my Windows Phone
------------------------------------------------------------------------
From: Daniel Holth <mailto:dho...@gmail.com>
Sent: ‎6/‎4/‎2014 5:17
To: Paul Sokolovsky <mailto:pmis...@gmail.com>
Cc: python-dev <mailto:python-dev@python.org>
Subject: Re: [Python-Dev] Internal representation of strings andMicropython
If we're voting I think representing Unicode internally in micropython
as utf-8 with O(N) indexing is a great idea, partly because I'm not
sure indexing into strings is a good idea - lots of Unicode code
points don't make sense by themselves; see also grapheme clusters. It
would probably work great.

I think native UTF-8 support is the most promising route for amicropython Unicode support.

It would be an interesting proof-of-concept to implement an alternativeCPython with PEP-393 replaced by UTF-8 internally... doing conversionsfor APIs that require a different encoding, but always maintaining andcomputing with the UTF-8 representation.

1) The first proof-of-concept implementation should implement codepointindexing as a O(N) operation, searching from the beginning of the stringfor the Nth codepoint.

Other Proof-of-concept implementation could implement a codepointboundary cache, there could be a variety of caching algorithms.

2) (Least space efficient) An array that could be indexed by codepointposition and result in byte position. (This would use more space than aUTF-32 representation!)

3) (Most space efficient) One cached entry, that caches the lastcodepoint/byte position referenced. UTF-8 is able to be traversed ineither direction, so "next/previous" codepoint access would berelatively fast (and such are very common operations, even when indexingnotation is used: "for ix in range( len( str_x )): func( str_x[ ix ])".)

4) (Fixed size caches) N entries, one for the last codepoint, andothers at Codepoint_Length/N intervals. N could be tunable.


5) (Fixed size caches)  Like 4, plus an extra entry like 3.

6) (Variable size caches) Like 2, but only indexing every Nth codepoint. N could be tunable.


7) (Variable size caches)  Like 6, plus an extra entry like 3.

8) (Content specific variable size caches) Index each codepoint that isa different byte size than the previous codepoint, allowing indexing tobe used in the intervals. Worst case size is like 2, best case size is asingle entry for the end, when all code points are represented by thesame number of bytes.

9) (Content specific variable size caches) Like 8, only cache entriescould indicate fixed or variable size characters in the next interval,with a scheme like 4 or 6 used to prevent one interval from covering thewhole string.

Other hybrid schemes may present themselves as useful once experience isgained with some of these. It might be surprising how few algorithmsneed more than algorithm 3 to get reasonable performance.


Glenn

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

Reply via email to