Re: [Haskell-cafe] Re: String vs ByteString

wren ng thornton Tue, 17 Aug 2010 18:25:25 -0700

Bulat Ziganshin wrote:

Johan wrote:

So it's not clear to me that using UTF-16 makes the program
noticeably slower or use more memory on a real program.


it's clear misunderstanding. of course, not every program holds much
text data in memory. but some does, and here you will double memory
usage

I write programs that hold onto quite a good deal of natural languagetext; a few million words at least. Getting efficient Unicode for thatis a high priority. However, all of that text is in Japanese, Chinese,Arabic, Hindi, Urdu,... That's the reason I want Unicode. I'm prettysure UTF-16 isn't going to be causing any special problems here.

For NLP work, any language with a vaguely ASCII format isn't a problem.We've been shoving English and western European languages into a subsetof ASCII for years (heck, we don't even allow real parentheses!).

For the mostly English files on my harddrive, UTF-8 is a clear win. Butwhen it comes to programming, I'm not so sure. I'd like to see some goodbenchmarks and a clear explanation of where the costs are. Relying onintuitions is notoriously bad for these kinds of encoding issues.


--
Live well,
~wren
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

Reply via email to