On Fri, Mar 18, 2016 at 10:46 PM, Steven D'Aprano <st...@pearwood.info> wrote: > On Fri, 18 Mar 2016 06:00 pm, Ian Kelly wrote: > >> On Thu, Mar 17, 2016 at 1:21 PM, Rick Johnson >> <rantingrickjohn...@gmail.com> wrote: >>> In the event that i change my mind about Unicode, and/or for >>> the sake of others, who may want to know, please provide a >>> list of languages that *YOU* think handle Unicode better than >>> Python, starting with the best first. Thanks. > > Better than Python? Easy-peasy: > > List of languages with Unicode handling which is better than Python = [] > > I'm not aware of any language with better or more complete Unicode > functionality than Python's. (That doesn't necessarily mean that they don't > exist.)
And this also doesn't preclude languages that have *as good* handling as Python's, of which I know of one off-hand, and there may be any number. (Trivial case: Take Python 3.5, change the definition of a block to be { } instead of indentation, and release it as Bracethon 1.0. Voila, a distinct-yet-related language whose Unicode handling is exactly as good as Python's.) >> jmf has been asked this before, and as I recall he seems to feel that >> UTF-8 should be used for all purposes, ignoring the limitations of >> that encoding such as that indexing becomes a O(n) operation. > > Technically, UTF-8 doesn't *necessarily* imply indexing is O(n). For > instance, your UTF-8 string might consist of an array of bytes containing > the string, plus an array of indexes to the start of each code point. For > example, the string: > > “abcπßЊ•𒀁” > > (including the quote marks) is 10 code points in length and 22 bytes as > UTF-8. Grouping the (hex) bytes for each code point, we have: > > e2809c 61 62 63 cf80 c39f d08a e280a2 f0928081 e2809d > > so we could get a O(1) UTF-8 string by recording the bytes (in hex) plus the > indexes (in decimal) in which each code point starts: > > e2809c616263cf80c39fd08ae280a2f0928081e2809d > > 0 3 4 5 6 8 10 12 15 19 > > but (assuming each index needs 2 bytes, which supports strings up to 65535 > characters in length), that's actually LESS memory efficient than UTF-32: > 42 bytes versus 40. A lot of strings will have no more than 255 non-ASCII characters in them. (For example, all strings which no more than 255 total characters.) You could store, instead of the indexes themselves, a series of one-byte offsets: e2809c616263cf80c39fd08ae280a2f0928081e2809d 0 2 2 2 2 3 4 5 7 10 Locating a byte based on its character position is still O(1); you look up that position in the offset table, add that to your original character position, and you have the byte location. For strings with too many non-ASCII codepoints, you'd need some other representation, but at that point, it might be worth just switching to UTF-32. Of course, O(1) isn't the ultimate goal to the exclusion of all else. For a simple sequential parser, indexing might be such a rare operation that it's okay for it to be O(N), as you're never going to index more than a few characters from a known position. Or if you're trying to search a few gig of text, it's entirely possible that transcoding into an indexable format is a complete waste of time, and it's better to just work with a stream of bytes straight off the disk. But for a general string type in a high level language, I'm normally going to assume that indexing is fairly cheap. ChrisA -- https://mail.python.org/mailman/listinfo/python-list