Re: Chardet, file, ... and the Flexible String Representation

Ned Batchelder Mon, 09 Sep 2013 10:18:37 -0700

On 9/9/13 10:28 AM, wxjmfa...@gmail.com wrote:

Le vendredi 6 septembre 2013 17:46:14 UTC+2, Piet van Oostrum a écrit :

wxjmfa...@gmail.com writes:

The Flexible String Representation has conceptually to
face the same problem. It splits "unicode" in chunks and
it has to solve two problems at the same time, the coding
and the handling of multiple "char sets". The problem?
It fails.
"This poor Flexible String Representation does not succeed
to solve the problem it create itsself."



The FSR does not split unicode in chuncks. It does not create problems and 
therefore it doesn't have to solve this.



The FSR simply stores a Unicode string as an array[*] of ints (the Unicode code 
points of the characters of the string. That's it. Then it uses a 
memory-efficient way to store this array of ints. But that has nothing to do 
with character sets. The same principle could be used for any array of ints.



So you are seeking problems where there are none. And you would have a lot more 
peace of mind if you stopped doing this.



[*] array in the C sense.

--

Piet van Oostrum <p...@vanoostrum.org>

WWW: http://pietvanoostrum.com/

PGP key: [8DAE142BE17999C4]

----------


Due to its nature, a character cann't be handled in the
same way a one another type. That's the purpose of the UTF.

-----

Chunk latin-1, perfomance

ref:

timeit.timeit("a = 'hundred'; 'x' in a")

0.13144639994075646

timeit.timeit("a = 'hundrez'; 'x' in a")

0.13780295544393084

Chunk ucs2, perfomance

timeit.timeit("a = 'hundre€'; 'x' in a")

0.23505392241617074

Chunk ucs4, perfomance

timeit.timeit("a = 'hundre\U0001d11e'; 'x' in a")

0.26266673650735584

Comment: Such differences never happen with utf.

-----

Chunk latin-1, memory

sys.getsizeof('a')

26

Chunk ucs2, memory

sys.getsizeof('€')

40

Comment: 14 bytes more than latin-1

Chunk ucs4, memory

sys.getsizeof('\U0001d11e')

44

Comment: 18 bytes more than latin-1

Comment: With utf, a char (in string or not) never exceed 4

bytes.

-----

'a' + '€' in utf, conceptually

Concatenate the *unicode tranformation units*.
Some kind of a real direct 'a' + '€'.


'a' + '€' in FSR, conceptually

1) Check the "internal coding" of 'a'
2) Check the "internal coding" of '€'
3) Compare these codings

4a) If they match, concatenate the bytes

4b) If they do not match
        5) Reencode the string which has to
        6) Concatenate
        7) Set the "internal coding" status for
        further processing

-----

Complicate and full of side effects, eg :

sys.getsizeof('a')

sys.getsizeof('aé')

39

Is not a latin-1 "é" supposed to count as a latin-1 "a" ?

----

I picked up random methods, there may be variations, basically
this general behaviour is always expected.


jmf

jmf, thanks for your reply. You've calmed my fears that there issomething wrong with the Flexible String Representation. None of theexamples you show demonstrate any behavior contrary to the Unicode spec.


--Ned.
--
https://mail.python.org/mailman/listinfo/python-list

Re: Chardet, file, ... and the Flexible String Representation

Reply via email to