On Thursday, April 30, 2015 at 6:34:23 PM UTC-4, Páll Haraldsson wrote:
>
> Interesting.. does that mean Unicode then that is esp. faster or something 
> else?
>
> >800x faster is way worse than I thought and no good reason for it..
>

That particular case is because CPython (which is the standard C 
implementation of Python, what most people mean when they use Python), has 
optimized the case of

var += string

which is appending to a variable.

Although strings *are* immutable in Python, as in Julia, Python detects 
that you are replacing a string with the string concatenated with another, 
and if
nobody else has a reference to the string in that variable, it can simply 
update the string in place, and otherwise, it makes a new string big enough 
for the result,
and sets the variable to that new string.
 

> I'm really intrigued what is this slow, can't be the simple things like 
> say just string concatenation?!
>
> You can get similar speed using PyCall.jl :)
>

I'm not so sure... I don't really think so - because you still have to move 
the string from Julia (which uses either ASCII or UTF-8 for strings by 
default, you have to specifically
convert them to get UTF-16 or UTF-32...) to Python, and then back... and 
Julia's string conversions are rather slow... O(n^2) in most cases...
(I'm working in improving that, I hope I can get my changes accepted into 
Julia's Base)

For some obscure function like Levenshtein distance I might expect this (or 
> not implemented yet in Julia) as Python would use tuned C code or in any 
> function where you need to do non-trivial work per function-call.
>
>
> I failed to add regex to the list as an example as I was pretty sure it 
> was as fast (or faster, because of macros) as Perl as it is using the same 
> library.
>
> Similarly for all Unicode/UTF-8 stuff I was not expecting slowness. I know 
> the work on that in Python2/3 and expected Julia could/did similar.
>

No, a lot of the algorithms are O(n) instead of O(1), because of the 
decision to use UTF-8...
I'd like to convince the core team to change Julia to do what Python 3 does.
UTF-8 is pretty bad to use for internal string representation (where it 
shines is an an interchange format).
UTF-8 can take up to 50% more storage than UTF-16 if you are just dealing 
with BMP characters.
If you have some field that needs to hold a certain number of Unicode 
characters, for the full range of Unicode,
you need to allocate 4 bytes for every character, so no savings compared to 
UTF-16 or UTF-32.

Python 3 internally stores strings as either: 7-bit (ASCII), 8-bit (ANSI 
Latin1, only characters < 0x100 present), 16-bit (UCS-2, i.e. there are no 
non-BMP characters present),
or 32-bit (UTF-32).  You might wonder why there is a special distinction 
between 7-bit ASCII and 8-bit ANSI Latin 1... they are both Unicode 
subsets, but 7-bit ASCII
can also be used directly without conversion as UTF-8.
All internal formats are directly addressable (unlike Julia's UTF8String 
and UTF16String), and the conversions between the 4 internal types is very 
fast, simple
widening (or a no-op, as in the case of ASCII -> ANSI), when going from 
smaller to larger.

Julia also has a big problem with always wanting to have a terminating \0 
byte or word, which means that you can't take a substring or slice of 
another string without
making a copy to be able to add that terminating \0 (so lots of extra 
memory allocation and garbage collection for common algorithms).

I hope that makes things a bit clearer!

Scott

Reply via email to