On Thursday, April 30, 2015 at 6:34:23 PM UTC-4, Páll Haraldsson wrote: > > Interesting.. does that mean Unicode then that is esp. faster or something > else? > > >800x faster is way worse than I thought and no good reason for it.. >
That particular case is because CPython (which is the standard C implementation of Python, what most people mean when they use Python), has optimized the case of var += string which is appending to a variable. Although strings *are* immutable in Python, as in Julia, Python detects that you are replacing a string with the string concatenated with another, and if nobody else has a reference to the string in that variable, it can simply update the string in place, and otherwise, it makes a new string big enough for the result, and sets the variable to that new string. > I'm really intrigued what is this slow, can't be the simple things like > say just string concatenation?! > > You can get similar speed using PyCall.jl :) > I'm not so sure... I don't really think so - because you still have to move the string from Julia (which uses either ASCII or UTF-8 for strings by default, you have to specifically convert them to get UTF-16 or UTF-32...) to Python, and then back... and Julia's string conversions are rather slow... O(n^2) in most cases... (I'm working in improving that, I hope I can get my changes accepted into Julia's Base) For some obscure function like Levenshtein distance I might expect this (or > not implemented yet in Julia) as Python would use tuned C code or in any > function where you need to do non-trivial work per function-call. > > > I failed to add regex to the list as an example as I was pretty sure it > was as fast (or faster, because of macros) as Perl as it is using the same > library. > > Similarly for all Unicode/UTF-8 stuff I was not expecting slowness. I know > the work on that in Python2/3 and expected Julia could/did similar. > No, a lot of the algorithms are O(n) instead of O(1), because of the decision to use UTF-8... I'd like to convince the core team to change Julia to do what Python 3 does. UTF-8 is pretty bad to use for internal string representation (where it shines is an an interchange format). UTF-8 can take up to 50% more storage than UTF-16 if you are just dealing with BMP characters. If you have some field that needs to hold a certain number of Unicode characters, for the full range of Unicode, you need to allocate 4 bytes for every character, so no savings compared to UTF-16 or UTF-32. Python 3 internally stores strings as either: 7-bit (ASCII), 8-bit (ANSI Latin1, only characters < 0x100 present), 16-bit (UCS-2, i.e. there are no non-BMP characters present), or 32-bit (UTF-32). You might wonder why there is a special distinction between 7-bit ASCII and 8-bit ANSI Latin 1... they are both Unicode subsets, but 7-bit ASCII can also be used directly without conversion as UTF-8. All internal formats are directly addressable (unlike Julia's UTF8String and UTF16String), and the conversions between the 4 internal types is very fast, simple widening (or a no-op, as in the case of ASCII -> ANSI), when going from smaller to larger. Julia also has a big problem with always wanting to have a terminating \0 byte or word, which means that you can't take a substring or slice of another string without making a copy to be able to add that terminating \0 (so lots of extra memory allocation and garbage collection for common algorithms). I hope that makes things a bit clearer! Scott
