Re: [julia-users] Not fun

Scott Jones Fri, 17 Apr 2015 16:49:46 -0700

On Friday, April 17, 2015 at 5:54:40 PM UTC-4, Milan Bouchet-Valat wrote:
>
> Le vendredi 17 avril 2015 à 10:57 -0700, Scott Jones a écrit : 
> > 
> > 
> > On Friday, April 17, 2015 at 12:41:06 PM UTC-4, Steven G. Johnson 
> > wrote: 
> >         
> >         
> >         On Friday, April 17, 2015 at 11:50:44 AM UTC-4, Scott Jones 
> >         wrote: 
> >                 Ugh... for some of what I'm doing, it is nice to know 
> >                 that a string contains only ASCII characters,  I 
> >                 really hope that you don't go ahead with removing 
> >                 ASCIIString. 
> >         
> >         
> >         You can always call isascii(...) on a UTF-8 string  ... can 
> >         you explain why you care in your application? 
> >         
> > 
> > 
> > I assume that is then an O(n) function, is it not? 
> > 
> > 
> > I'd rather have something that is O(1)! 
> Of course, if you need to check all characters one by one. But if you 
> know it's ASCII, you don't need to do that check -- and apparently you 
> do since you're willing to use ASCIIString... 
>


That is why I would be sad if Julia didn't distinguish between ASCIIString 
and UTFString with its immutable strings...
it would force me to do O(n) checks to see if a string was really just 
ASCII characters (or ANSI Latin 1), and could be
stored in just 1 byte per character.
 

> Also, you can usually write your code so that it's able to handle 
> Unicode as well as pure ASCII quite efficiently. For example, if you 
> iterate over a string and check for a given character or substring, you 
> don't have to wonder whether non-ASCII characters might be present, 
> everything works magically. 
>

Much quicker to check for a particular character, than for a sequence of 
characters, at least in my experience...
I spent quite a lot of time optimizing string handling over the last 29 
years for a language/database...
 

> >         
> >         
> >                 (I can treat it as ANSI Latin 1, without any 
> >                 modification, 
> >         
> >         
> >         You can treat it as UTF-8, without any modification... 
> > 
> > 
> > I don't want anything that requires O(n) operations for a lot of the 
> > string handling I need to do... 
> What kind of string handling? UTF-8 requiring O(n) is a myth, it only 
> happens if you need to access the code point number n from the beginning 
> of the string. If you just need to go over a string to do some 
> processing, then it doesn't change much whether the string is ASCII 
> stored in a UTF8String or ASCII stored in ASCIIString. 
>

A myth?  Sorry, but accessing code point number n is *precisely* what a lot 
of code
needs to do... I've dealt with the issues back from the days of dealing 
with SJIS, EUC, GB and other
multibyte character sets...
Having many of the most common operations in a language used mostly for 
string/database processing
go from O(1) to O(n) is NOT good!
That's why back in '95, when asked to add support for Japan, I insisted on 
using Unicode (1.0) instead of a
multibyte character set... which was a rather prescient decision...

>                 or expand it to UTF-16 or UTF-32 by just widening 
> >                 bytes to 16-bit or 32-bit words, which I can do *very* 
> >                 fast in x86-64 assembly, esp. with some of the newer 
> >                 instructions!) 
> >         
> >         
> >         Why would you want to do this?  Except for interop with 
> >         foreign libraries?  UTF-16 is the worst of all worlds as an 
> >         encoding. 
> >         
> >         
> > 
> > 
> > UTF-16 that I know has no surrogate pairs... really just UCS2... 
> > Depends on what you are doing with UTF-16...  and yes, interop with 
> > Java, JDBC, lots of databases... UTF-16 is very useful... 
> > UTF-8 can blow things up to 3 bytes per character, potentially taking 
> > 1.5 times the space of UTF-16... not good! 
> I think you should really benchmark this kind of thing before choosing a 
> legacy encoding such as Latin-1. Even for Asian scripts, typical content 
> contains a lot of ASCII markup which makes UTF-8 actually more 
> efficient: 
> http://utf8everywhere.org/#faq.asians 
> <http://www.google.com/url?q=http%3A%2F%2Futf8everywhere.org%2F%23faq.asians&sa=D&sntz=1&usg=AFQjCNGuO96k_qGMOJdGdd3v9ThYw-HWlg>
>  
>
>
You don't understand - I only use ANSI Latin 1 because it is a strict 
subset of Unicode... it's mainly a very fast way of saving 50% of your disk,
without any complications of complicated conversions.
I designed a system that used ANSI Latin 1 to store data if all of the 
characters were < 256, and UTF-16 otherwise... (along with a packed Unicode 
format
that takes much less space than UTF-8, and less than even S-JIS for 
Japanese data sets...)
I spent about 19-20 years benchmarking exactly this...

Also, if you can afford checking that your UTF-16 has not surrogate 
> pairs, you can also afford checking whether it's plain ASCII stored in a 
> UTF8String or not. 
>

The point is, that if the language already *knows* that it is just plain 
ASCII, I can take advantage of that, no O(n) checking required.
A large number of the sources will be from databases where ANSI Latin1 is 
the default character set, and UTF8 or UTF16 is only used
if it is known that Unicode is needed...

If the source gives me UTF-16 or UTF-32, then I will figure out how it can 
be most efficiently stored (never UTF32, of course, UTF16
always takes less space, usually quite a bit, unless you have a record full 
of emojis! ;-) (and even then, it would just take the same amount
of space as the UTF-32 representation, or UTF-8, as long as the UTF-8 
encoder works correctly and represents them in 4 bytes instead of doing
the surrogate pair as two 3 byte sequences... which at least used to be a 
common problem).

UTF-8 can easily blow up Greek, Hebrew, Russian, Arabic (and many other 
languages) text to take twice as much space, and Asian text to 50% more,
it all really depends on what's in the records... (I also worked for a 
number of years dealing with support for unstructured data... so I 
understand
pretty well the frequencies of 1, 2, 3, or 4 byte UTF-8 encodings of 
Unicode characters in things like books, magazine articles, doctor's 
notes...)
 

> >                 I do also think it would be nice to have 8-bit (ANSI 
> >                 Latin 1 or binary, not UTF-8), 16-bit (UCS2) 
> >         
> >         
> >         Again, for interop with legacy files?  I can see no other 
> >         reason to use Latin-1 (or Windows 1252) or UCS2 these days. 
> > 
> > 
> > Latin-1 is a strict subset of Unicode (I'd never use anything other 
> > than Unicode - but both ASCII and ANSI Latin-1 are just subsets, 
> > and as such are rather useful to save space when you most of the time 
> > you are just dealing with text from Western Europe, the Americas, 
> > Australia & New Zealand... 
> You don't save any space by storing ASCII text in Latin-1 instead of 
> UTF-8... 
>
> > It's all about performance, both when doing string handling, and when 
> > saving/reading something to a database (or sending it over a wire)... 
> I think you really should look at concrete examples and do some 
> benchmarking. I doubt it will make a difference in most typical uses. 
>

Again, I did spend quite a lot of time looking at benchmark results of 
exactly that, over a period of almost 20 years...

That said, I'm not opposed to keeping ASCIIString somewhere (in a 
> package?), as long as it's clear it's only intended for very specific 
> cases. 
>

That really makes it not that useful for me... I'll just have to do the 
O(n) checking/conversions of the UTF-8 strings
from Julia, and to keep up performance, I'll just have separate methods 
that take Vectors of UInt8, UInt16, and UInt32 to handle whatever I get 
from the
database.
 

> Regards 
>

Regards as well

Re: [julia-users] Not fun

Reply via email to