On Mon, 20 Aug 2012 00:44:22 -0400, Roy Smith wrote: > In article <5031bb2f$0$29972$c3e8da3$54964...@news.astraweb.com>, > Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote: > >> > So it may be with utf-8 someday. >> >> Only if you believe that people's ability to generate data will remain >> lower than people's ability to install more storage. > > We're not talking *data*, we're talking *text*. Most of those > whatever-bytes people are generating are images, video, and music. Text > is a pittance compared to those.
Paul Rubin already told you about his experience using OCR to generate multiple terrabytes of text, and how he would not be happy if that was stored in UCS-4. HTML is text. XML is text. SVG is text. Source code is text. Email is text. (Well, it's actually bytes, but it looks like ASCII text.) Log files are text, and they can fill a hard drive pretty quickly. Lots of data is text. Pittance or not, I do not believe that people will widely abandon compact storage formats like UTF-8 and Latin-1 for UCS-4 any time soon. Given that we're still trying to convince people to use UTF-8 over ASCII, I reckon it will be at least 40 years before there's even a slim chance of migrating from UTF-8 to UCS-4 in a widespread manner. In the IT world, that's close enough to "never" -- we might not even be using Unicode in 2052. In any case, time will tell who is right. -- Steven -- http://mail.python.org/mailman/listinfo/python-list