On Fri, Jan 6, 2012 at 4:45 PM, Tab Atkins Jr. <[email protected]> wrote:
> Note that this may be subject to the same counter-intuitive forces > that cause UTF-8 to usually be better for CJK HTML pages (because a > lot of the source is ASCII markup). In JSON, all of the markup > artifacts (braces, brackets, quotes, colon, commas, spaces) are ASCII, > along with numbers, bools, and null. Only the contents of strings can > be non-ascii. > > JSON is generally lighter on markup than XML-like languages, so the > effect may not be as pronounced, but it shouldn't be dismissed without > some study. At minimum, it will *reduce* the size difference between > the two. > And more fundamentally, this is trying to repurpose charsets as a compression mechanism. If you want compression, use compression (Transfer-Encoding: gzip): -rw-rw-r-- 1 glenn glenn 7274 Jan 06 23:59 test-utf8.txt -rw-rw-r-- 1 glenn glenn 3672 Jan 06 23:59 test-utf8.txt.gz -rw-rw-r-- 1 glenn glenn 6150 Jan 06 23:59 test-utf16.txt -rw-rw-r-- 1 glenn glenn 3468 Jan 06 23:59 test-utf16.txt.gz The difference even without compression isn't enough to warrant the complexity (~15%), and with compression the difference is under 10%. (Test case is simply copying the rendered text from http://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8in Firefox.) -- Glenn Maynard
