On Friday, April 17, 2015 at 5:54:40 PM UTC-4, Milan Bouchet-Valat wrote: > > Le vendredi 17 avril 2015 à 10:57 -0700, Scott Jones a écrit : > > > > > > On Friday, April 17, 2015 at 12:41:06 PM UTC-4, Steven G. Johnson > > wrote: > > > > > > On Friday, April 17, 2015 at 11:50:44 AM UTC-4, Scott Jones > > wrote: > > Ugh... for some of what I'm doing, it is nice to know > > that a string contains only ASCII characters, I > > really hope that you don't go ahead with removing > > ASCIIString. > > > > > > You can always call isascii(...) on a UTF-8 string ... can > > you explain why you care in your application? > > > > > > > > I assume that is then an O(n) function, is it not? > > > > > > I'd rather have something that is O(1)! > Of course, if you need to check all characters one by one. But if you > know it's ASCII, you don't need to do that check -- and apparently you > do since you're willing to use ASCIIString... >
That is why I would be sad if Julia didn't distinguish between ASCIIString and UTFString with its immutable strings... it would force me to do O(n) checks to see if a string was really just ASCII characters (or ANSI Latin 1), and could be stored in just 1 byte per character. > Also, you can usually write your code so that it's able to handle > Unicode as well as pure ASCII quite efficiently. For example, if you > iterate over a string and check for a given character or substring, you > don't have to wonder whether non-ASCII characters might be present, > everything works magically. > Much quicker to check for a particular character, than for a sequence of characters, at least in my experience... I spent quite a lot of time optimizing string handling over the last 29 years for a language/database... > > > > > > (I can treat it as ANSI Latin 1, without any > > modification, > > > > > > You can treat it as UTF-8, without any modification... > > > > > > I don't want anything that requires O(n) operations for a lot of the > > string handling I need to do... > What kind of string handling? UTF-8 requiring O(n) is a myth, it only > happens if you need to access the code point number n from the beginning > of the string. If you just need to go over a string to do some > processing, then it doesn't change much whether the string is ASCII > stored in a UTF8String or ASCII stored in ASCIIString. > A myth? Sorry, but accessing code point number n is *precisely* what a lot of code needs to do... I've dealt with the issues back from the days of dealing with SJIS, EUC, GB and other multibyte character sets... Having many of the most common operations in a language used mostly for string/database processing go from O(1) to O(n) is NOT good! That's why back in '95, when asked to add support for Japan, I insisted on using Unicode (1.0) instead of a multibyte character set... which was a rather prescient decision... > or expand it to UTF-16 or UTF-32 by just widening > > bytes to 16-bit or 32-bit words, which I can do *very* > > fast in x86-64 assembly, esp. with some of the newer > > instructions!) > > > > > > Why would you want to do this? Except for interop with > > foreign libraries? UTF-16 is the worst of all worlds as an > > encoding. > > > > > > > > > > UTF-16 that I know has no surrogate pairs... really just UCS2... > > Depends on what you are doing with UTF-16... and yes, interop with > > Java, JDBC, lots of databases... UTF-16 is very useful... > > UTF-8 can blow things up to 3 bytes per character, potentially taking > > 1.5 times the space of UTF-16... not good! > I think you should really benchmark this kind of thing before choosing a > legacy encoding such as Latin-1. Even for Asian scripts, typical content > contains a lot of ASCII markup which makes UTF-8 actually more > efficient: > http://utf8everywhere.org/#faq.asians > <http://www.google.com/url?q=http%3A%2F%2Futf8everywhere.org%2F%23faq.asians&sa=D&sntz=1&usg=AFQjCNGuO96k_qGMOJdGdd3v9ThYw-HWlg> > > > You don't understand - I only use ANSI Latin 1 because it is a strict subset of Unicode... it's mainly a very fast way of saving 50% of your disk, without any complications of complicated conversions. I designed a system that used ANSI Latin 1 to store data if all of the characters were < 256, and UTF-16 otherwise... (along with a packed Unicode format that takes much less space than UTF-8, and less than even S-JIS for Japanese data sets...) I spent about 19-20 years benchmarking exactly this... Also, if you can afford checking that your UTF-16 has not surrogate > pairs, you can also afford checking whether it's plain ASCII stored in a > UTF8String or not. > The point is, that if the language already *knows* that it is just plain ASCII, I can take advantage of that, no O(n) checking required. A large number of the sources will be from databases where ANSI Latin1 is the default character set, and UTF8 or UTF16 is only used if it is known that Unicode is needed... If the source gives me UTF-16 or UTF-32, then I will figure out how it can be most efficiently stored (never UTF32, of course, UTF16 always takes less space, usually quite a bit, unless you have a record full of emojis! ;-) (and even then, it would just take the same amount of space as the UTF-32 representation, or UTF-8, as long as the UTF-8 encoder works correctly and represents them in 4 bytes instead of doing the surrogate pair as two 3 byte sequences... which at least used to be a common problem). UTF-8 can easily blow up Greek, Hebrew, Russian, Arabic (and many other languages) text to take twice as much space, and Asian text to 50% more, it all really depends on what's in the records... (I also worked for a number of years dealing with support for unstructured data... so I understand pretty well the frequencies of 1, 2, 3, or 4 byte UTF-8 encodings of Unicode characters in things like books, magazine articles, doctor's notes...) > > I do also think it would be nice to have 8-bit (ANSI > > Latin 1 or binary, not UTF-8), 16-bit (UCS2) > > > > > > Again, for interop with legacy files? I can see no other > > reason to use Latin-1 (or Windows 1252) or UCS2 these days. > > > > > > Latin-1 is a strict subset of Unicode (I'd never use anything other > > than Unicode - but both ASCII and ANSI Latin-1 are just subsets, > > and as such are rather useful to save space when you most of the time > > you are just dealing with text from Western Europe, the Americas, > > Australia & New Zealand... > You don't save any space by storing ASCII text in Latin-1 instead of > UTF-8... > > > It's all about performance, both when doing string handling, and when > > saving/reading something to a database (or sending it over a wire)... > I think you really should look at concrete examples and do some > benchmarking. I doubt it will make a difference in most typical uses. > Again, I did spend quite a lot of time looking at benchmark results of exactly that, over a period of almost 20 years... That said, I'm not opposed to keeping ASCIIString somewhere (in a > package?), as long as it's clear it's only intended for very specific > cases. > That really makes it not that useful for me... I'll just have to do the O(n) checking/conversions of the UTF-8 strings from Julia, and to keep up performance, I'll just have separate methods that take Vectors of UInt8, UInt16, and UInt32 to handle whatever I get from the database. > Regards > Regards as well
