Re: [julia-users] Not fun

Milan Bouchet-Valat Fri, 17 Apr 2015 14:55:44 -0700

Le vendredi 17 avril 2015 à 10:57 -0700, Scott Jones a écrit :
> 
> 
> On Friday, April 17, 2015 at 12:41:06 PM UTC-4, Steven G. Johnson
> wrote:
>         
>         
>         On Friday, April 17, 2015 at 11:50:44 AM UTC-4, Scott Jones
>         wrote:
>                 Ugh... for some of what I'm doing, it is nice to know
>                 that a string contains only ASCII characters,  I
>                 really hope that you don't go ahead with removing
>                 ASCIIString.
>         
>         
>         You can always call isascii(...) on a UTF-8 string  ... can
>         you explain why you care in your application?
>         
> 
> 
> I assume that is then an O(n) function, is it not?
> 
> 
> I'd rather have something that is O(1)!
Of course, if you need to check all characters one by one. But if you
know it's ASCII, you don't need to do that check -- and apparently you
do since you're willing to use ASCIIString...


Also, you can usually write your code so that it's able to handle
Unicode as well as pure ASCII quite efficiently. For example, if you
iterate over a string and check for a given character or substring, you
don't have to wonder whether non-ASCII characters might be present,
everything works magically.

>         
>         
>                 (I can treat it as ANSI Latin 1, without any
>                 modification,
>         
>         
>         You can treat it as UTF-8, without any modification...
> 
> 
> I don't want anything that requires O(n) operations for a lot of the
> string handling I need to do... 
What kind of string handling? UTF-8 requiring O(n) is a myth, it only
happens if you need to access the code point number n from the beginning
of the string. If you just need to go over a string to do some
processing, then it doesn't change much whether the string is ASCII
stored in a UTF8String or ASCII stored in ASCIIString.

>                 or expand it to UTF-16 or UTF-32 by just widening
>                 bytes to 16-bit or 32-bit words, which I can do *very*
>                 fast in x86-64 assembly, esp. with some of the newer
>                 instructions!)
>         
>         
>         Why would you want to do this?  Except for interop with
>         foreign libraries?  UTF-16 is the worst of all worlds as an
>         encoding.
>         
>         
> 
> 
> UTF-16 that I know has no surrogate pairs... really just UCS2...
> Depends on what you are doing with UTF-16...  and yes, interop with
> Java, JDBC, lots of databases... UTF-16 is very useful...
> UTF-8 can blow things up to 3 bytes per character, potentially taking
> 1.5 times the space of UTF-16... not good!
I think you should really benchmark this kind of thing before choosing a
legacy encoding such as Latin-1. Even for Asian scripts, typical content
contains a lot of ASCII markup which makes UTF-8 actually more
efficient:
http://utf8everywhere.org/#faq.asians

Also, if you can afford checking that your UTF-16 has not surrogate
pairs, you can also afford checking whether it's plain ASCII stored in a
UTF8String or not.


>                 I do also think it would be nice to have 8-bit (ANSI
>                 Latin 1 or binary, not UTF-8), 16-bit (UCS2)
>         
>         
>         Again, for interop with legacy files?  I can see no other
>         reason to use Latin-1 (or Windows 1252) or UCS2 these days.
> 
> 
> Latin-1 is a strict subset of Unicode (I'd never use anything other
> than Unicode - but both ASCII and ANSI Latin-1 are just subsets,
> and as such are rather useful to save space when you most of the time
> you are just dealing with text from Western Europe, the Americas,
> Australia & New Zealand...
You don't save any space by storing ASCII text in Latin-1 instead of
UTF-8...

> It's all about performance, both when doing string handling, and when
> saving/reading something to a database (or sending it over a wire)...
I think you really should look at concrete examples and do some
benchmarking. I doubt it will make a difference in most typical uses.


That said, I'm not opposed to keeping ASCIIString somewhere (in a
package?), as long as it's clear it's only intended for very specific
cases.


Regards

Re: [julia-users] Not fun

Reply via email to