Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage

Robin Kruppe Tue, 08 Mar 2016 08:11:54 -0800

Hi all,

I just wanted to mention that several other language implementors have
faced the same problem of dealing with "UTF-16" containing lone surrogate
code points and representing it in "UTF-8", and they have come up with
essentially the same solution. Users include the Racket, Scheme 48, and
Rust languages (all three only for I/O on Windows) and the Servo browser
engine (for the sake of JavaScript). Recently Simon Sapin of Mozilla has
spec'd this trick in exhausting detail, christening it WTF-8:
https://simonsapin.github.io/wtf-8/
While everything described there may be pretty obvious (for those immersed
in the guts of Unicode), I wanted to raise awareness that this has a name
and other users.


Cheers,
Robin

On 8 March 2016 at 16:16, Armin Rigo <ar...@tunes.org> wrote:

> Hi hubo,
>
> On 7 March 2016 at 13:49, hubo <h...@jiedaibao.com> wrote:
> > I think in Python 3.x, u'\ud805\udc09' is not another format of
> > u'\U00011409', it is just an illegal unicode string. It also raises
> > UnicodeEncodeError if you try to encode it into UTF-8. The problem is
> that
> > it is legal to define and use these strings. If PyPy uses UTF-8 or
> UTF-16 as
> > the internal storage format, I don't think it is possible to keep these
> > details same as CPython, but it should be acceptable.
>
> We're good at keeping obscure details the same as CPython.  It's only
> a matter of adding the correct checks on top of the encode() and
> decode() methods, independently of the underlying representation.
>
> In this case, because we can consider the length-1 unicode string
> u'\ud805', then we have to internally represent it somehow, and the
> natural way would be to represent it as the 3 bytes '\xed\xa0\x85'.
> So for u'\ud805\udc09' we use 6 bytes.  Strictly speaking, we're thus
> not using utf-8 internally, but
> "utf-8-without-extra-consistency-checks".  In Python 2,
> u'\ud805\udc09'.decode('utf-8') returns '\xf0\x91\x90\x89', i.e. a
> single code point of 4 bytes.  This means that calling
> ``decode('utf-8')`` has to check for surrogates, and do something more
> complicated on Python 2.x (or complain on Python 3.x).  In other
> words, neither ``decode('utf-8')`` nor ``encode('utf-8')`` can be
> no-ops.  Decoding and encoding need to check the data, and might
> actually need to make a copy in corner cases, but not in the vast
> majority of cases.
>
> This is all focused on the web and generally Linux approach of "utf-8
> everywhere".  For Windows, the story is more complicated.  CPython 2.x
> uses UTF-16, like the Windows API.  However, the recent CPython 3.x
> moved anyway towards a variable-encoding model of UCS-4 (==UTF-32).
> If you are on a recent CPython 3.x and build a unicode object with a
> large codepoint, and then call the Windows API with it, it will need
> anyway to convert it to UTF-16 dynamically, as far as I can
> tell---i.e. convert from UCS-4 to UTF-16.  In the proposal that is
> discussed here, it would instead have to convert from
> utf-8-without-extra-consistency-checks to UTF-16 in that situation.
>
> There are definitely trade-offs to explore, but I doubt that we can
> fully explore these trade-offs without actually trying it out.
>
>
> A bientôt,
>
> Armin.
> _______________________________________________
> pypy-dev mailing list
> pypy-dev@python.org
> https://mail.python.org/mailman/listinfo/pypy-dev
>

_______________________________________________
pypy-dev mailing list
pypy-dev@python.org
https://mail.python.org/mailman/listinfo/pypy-dev

Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage

Reply via email to