> My first question was how expensive python compares are vs C compares. And
> since python 2 has PyString_AS_STRING, which just gives you a char* pointer
> to a C string, I went in and replaced PyObject_RichCompareBool with strcmp
> and did a simple benchmark. And I was just totally blown away; it turns out
> you get something like a 40-50% improvement (at least on my simple
> benchmark).
> So that was the motivation for all this. Actually, if I wrote this for
> python 2, I might be able to get even better numbers (at least for strings),
> since we can't use strcmp in python 3. (Actually, I've heard UTF-8 strings
> are strcmp-able, so maybe if we go through and verify all the strings are
> UTF-8 we can strcmp them? I don't know enough about how PyUnicode stuff
> works to do this safely).

I'm not sure what you mean by "strcmp-able"; do you mean that the
lexical ordering of two Unicode strings is guaranteed to be the same
as the byte-wise ordering of their UTF-8 encodings? I don't think
that's true, but then, I'm not entirely sure how Python currently
sorts strings. Without knowing which language the text represents,
it's not possible to sort perfectly.

Problems are nonetheless still common when the algorithm has to
encompass more than one language. For example, in German dictionaries
the word ökonomisch comes between offenbar and olfaktorisch, while
Turkish dictionaries treat o and ö as different letters, placing oyun
before öbür.

Which means these lists would already be considered sorted, in their
respective languages:

rosuav@sikorsky:~$ python3
Python 3.7.0a0 (default:a78446a65b1d+, Sep 29 2016, 02:01:55)
[GCC 6.1.1 20160802] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> sorted(["offenbar", "ökonomisch", "olfaktorisch"])
['offenbar', 'olfaktorisch', 'ökonomisch']
>>> sorted(["oyun", "öbür", "parıldıyor"])
['oyun', 'parıldıyor', 'öbür']

So what's Python doing? Is it a codepoint ordering?

