* Steven D'Aprano:
On Fri, 29 Jan 2010 07:10:01 +0100, Alf P. Steinbach wrote:

   >>> L = ["æ", "ø", "å"]   # This is in SORTED ORDER in Norwegian L

[...]

   >>> L.sort( key = locale.strxfrm )
   >>> L
   ['å', 'æ', 'ø']
   >>> locale.strcoll( "å", "æ" )
   1
   >>> locale.strcoll( "æ", "ø" )
   -1

Note that strcoll correctly orders the strings as ["æ", "ø", "å"], that
is, it would have if it could have been used as cmp function to sort (or
better, to a separate routine named e.g. custom_sort).

This is in Python2.5, so I'm explicitly specifying unicode strings:

L = [u"æ", u"ø", u"å"]
assert sorted(L) == [u'å', u'æ', u'ø']

The default C-locale sorting of L does not equal to L. Now let's change to Norwegian locale:

import locale
locale.setlocale(locale.LC_ALL, 'nb_NO')
'nb_NO'
print u''.join(sorted(L, cmp=locale.strcoll))
æøå

So far so good, we've confirmed that in Python 2.5 we can sort in a locale-aware form. Now, can we do this with a key function? Thanks to Raymond Hettinger's recipe here:

http://code.activestate.com/recipes/576653/


print u''.join(sorted(L, key=CmpToKey(locale.strcoll)))
æøå


Success!

Let's try it in Python 3.1:

L = ["æ", "ø", "å"]
assert sorted(L) == ['å', 'æ', 'ø']

import locale
locale.setlocale(locale.LC_ALL, 'nb_NO')
'nb_NO'


Hm. A bit off-topic, but...

  >>> import locale
  >>> locale.setlocale(locale.LC_ALL, 'nb_NO')
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "C:\Program Files\cpython\python31\lib\locale.py", line 527, in
      return _setlocale(category, locale)
  locale.Error: unsupported locale setting
  >>> _

This on a machine where the Python default locale is Norwegian.


''.join(sorted(L, key=CmpToKey(locale.strcoll)))
'æøå'


The definition of CmpToKey can be found at the URL above. It's not very exciting, but here it is:


def CmpToKey(mycmp):
    'Convert a cmp= function into a key= function'
    class K(object):
        def __init__(self, obj, *args):
            self.obj = obj
        def __lt__(self, other):
            return mycmp(self.obj, other.obj) == -1
        def __gt__(self, other):
            return mycmp(self.obj, other.obj) == 1
        def __eq__(self, other):
            return mycmp(self.obj, other.obj) == 0
        def __le__(self, other):
return mycmp(self.obj, other.obj) != 1 def __ge__(self, other):
            return mycmp(self.obj, other.obj) != -1
        def __ne__(self, other):
            return mycmp(self.obj, other.obj) != 0
    return K

This is pretty smart as a generic solution.

Thanks! :-)

I was thinking more of sticking those comparisions in some custom string class or such, which would be rather ugly...

The above does have quite high overhead though!

That is, it's /inefficient/ compared to using a 'cmp' function directly.


If that's too verbose for you, stick this as a helper function in your application:


def CmpToKey(mycmp):
    'Convert a cmp= function into a key= function'
    class K(object):
        def __init__(self, obj, *args):
            self.obj = obj
        __lt__ = lambda s, o: mycmp(s.obj, o.obj) == -1
        __gt__ = lambda s, o: mycmp(s.obj, o.obj) == 1
        __eq__ = lambda s, o: mycmp(s.obj, o.obj) == 0
        __le__ = lambda s, o: mycmp(s.obj, o.obj) != 1
        __ge__ = lambda s, o: mycmp(s.obj, o.obj) != -1
        __ne__ = lambda s, o: mycmp(s.obj, o.obj) != 0
    return K


[...]
The above may just be a bug in the 3.x stxfrm. But it illustrates that
sometimes you have your sort order defined by a comparision function.
Transforming that into a key can be practically impossible (it can also
be quite inefficient).

This might be true, but you haven't demonstrated it.

The "can be ... practically impossible" was hogwash. I posted late, sorry. The "quite inefficient" holds.


With one little helper function, you should be able to convert any comparison function into a key function, with relatively little overhead.

I wouldn't call an extra Python method call per comparision "relatively little overhead".

Note that the same number of comparisions as with a 'cmp' based sort is 
performed.

But with the wrapper every comparision is indirected through a Python method call (I think, but not sure, that the creation of those comparision objects does not introduce significant overhead, but the calls must).


<code>
# -*- coding: utf-8 -*-
from __future__ import print_function
from __future__ import unicode_literals
try:
    range = xrange
except:
    pass


import locale
import sys
import random
import timeit

def CmpToKey( mycmp ):
    "Convert a cmp= function into a key= function"
    class K( object ):
        def __init__( self, obj, *args ):
            self.obj = obj
        def __lt__( self, other ):
            return mycmp( self.obj, other.obj ) == -1
        def __gt__( self, other ):
            return mycmp( self.obj, other.obj ) == 1
        def __eq__( self, other ):
            return mycmp( self.obj, other.obj ) == 0
        def __le__( self, other ):
            return mycmp( self.obj, other.obj ) != 1
        def __ge__( self, other ):
            return mycmp( self.obj, other.obj ) != -1
        def __ne__( self, other ):
            return mycmp( self.obj, other.obj ) != 0
    return K

def elapsed_time_for( f ):
    n_calls = 1
    return timeit.timeit( f, number = n_calls )

def major_pyversion():
    return sys.version_info[0]

def sorting_of( string_list ):
    def sort2x():
        string_list.sort( cmp = locale.strcoll )
    def sort3x():
        string_list.sort( key = CmpToKey( locale.strcoll ) )
    return sort2x if major_pyversion() <= 2 else sort3x


alphabet        = "abcdefghijklmnopqrstuvwxyzæøå"
n_strings       = int( 1e5 )
nbno_locale     = ""        # "nb_NO" raises exception on my machine.

def rnd_string():
    len = random.randint( 1, 80 )
    s = ""
    for i in range( len ):
        s = s + random.choice( alphabet )
    return s

locale.setlocale( locale.LC_ALL, nbno_locale )
random.seed( 666 )
assert( rnd_string() == "æmoxqudxlszåaåcteærmuocøjrdæpbvyabwhn" )

print( "Creating array of pseudo-random strings..." )
L = [rnd_string() for i in range( n_strings )]
print( "Sorting..." )
time = elapsed_time_for( sorting_of( L ) )
print( "Last string is '{0}'".format( L[-1] ) )
print( "{0}".format( time ) )
assert( L[-1][0] == "å" )
</code>


<example>
C:\Documents and Settings\Alf\test> py3 sort.py
Creating array of pseudo-random strings...
Sorting...
Last string is 'åååywxuybrxgvnstoyvcmdåjdlnxwwfsbnzsinxncmtrxuxgtqtduzvåixqovtjxiflfesuvfa'
8.45987772189

C:\Documents and Settings\Alf\test> py2 sort.py
Creating array of pseudo-random strings...
Sorting...
Last string is 'åååywxuybrxgvnstoyvcmdåjdlnxwwfsbnzsinxncmtrxuxgtqtduzvåixqovtjxiflfesuvfa'
3.41336790011

C:\Documents and Settings\Alf\test> _
</example>


In the run above Python 3.x was only roughly 2.5 times slower. On my machine it ranges from 2.5 times slower to 3 times slower. Most often around 2.5 x slower.

Of course this does not measure how much faster a Python 3.x cmp/"<" based sort /could be/, only how slow, sluggish and snail-like it is compared to 2.x. :-)

In other words, that's the speed price, 2.5 to 3 times slower sorting, for ditching cmp instead of e.g. replacing it with something better.


Cheers & hth.,

- Alf
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to