Re: trying to strip out non ascii.. or rather convert non ascii

Mark Lawrence Wed, 30 Oct 2013 08:28:37 -0700

On 30/10/2013 08:13, wxjmfa...@gmail.com wrote:

Le mercredi 30 octobre 2013 03:17:21 UTC+1, Chris Angelico a écrit :

On Wed, Oct 30, 2013 at 2:56 AM, Mark Lawrence <breamore...@yahoo.co.uk> wrote:

You've stated above that logically unicode is badly handled by the fsr.  You

then provide a trivial timing example.  WTF???




His idea of bad handling is "oh how terrible, ASCII and BMP have

optimizations". He hates the idea that it could be better in some

areas instead of even timings all along. But the FSR actually has some

distinct benefits even in the areas he's citing - watch this:

import timeit

timeit.timeit("a = 'hundred'; 'x' in a")


0.3625614428649451

timeit.timeit("a = 'hundreĳ'; 'x' in a")


0.6753936603674484

timeit.timeit("a = 'hundred'; 'ģ' in a")


0.25663261671525106

timeit.timeit("a = 'hundreĳ'; 'ģ' in a")


0.3582399439035271



The first two examples are his examples done on my computer, so you

can see how all four figures compare. Note how testing for the

presence of a non-Latin1 character in an 8-bit string is very fast.

Same goes for testing for non-BMP character in a 16-bit string. The

difference gets even larger if the string is longer:

timeit.timeit("a = 'hundred'*1000; 'x' in a")


10.083378194714726

timeit.timeit("a = 'hundreĳ'*1000; 'x' in a")


18.656413035735

timeit.timeit("a = 'hundreĳ'*1000; 'ģ' in a")


18.436268855399135

timeit.timeit("a = 'hundred'*1000; 'ģ' in a")


2.8308718007456264



Wow! The FSR speeds up searches immensely! It's obviously the best

thing since sliced bread!



ChrisA


---------


It is not obvious to make comparaisons with all these
methods and characters (lookup depending on the position
in the table, ...). The only think that can be done and
observed is the tendency between the subsets the FSR
artificially creates.
One can use the best algotithms to adjust bytes, it is
very hard to escape from the fact that if one manipulates
two strings with different internal representations, it
is necessary to find a way to have a "common internal
coding " prior manipulations.
It seems to me that this FSR, with its "negative logic"
is always attempting to "optimize" with the worst
case instead of "optimizing" with the best case.
This kind of effect is shining on the memory side.
Compare utf-8, which has a memory optimization on
a per code point basis with the FSR which has an
optimization based on subsets (One of its purpose).

# FSR
sys.getsizeof( ('a'*1000) + 'z')

sys.getsizeof( ('a'*1000) + '€')

# utf-8
sys.getsizeof( (('a'*1000) + 'z').encode('utf-8'))

sys.getsizeof( (('a'*1000) + '€').encode('utf-8'))

1020

jmf

How do theses figures compare to the ones quoted herehttps://mail.python.org/pipermail/python-dev/2011-September/113714.html ?


--
Python is the second best programming language in the world.
But the best has yet to be invented.  Christian Tismer

Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

Reply via email to