As far as line breaking is concerned, it's not hard to do it right for JapaneseChinese and Japanese (not Korean) don't use
whitespace between "words".
Ooh, that makes me curious: is there a good discussion of how to
line-break Japanese text? I wonder how browsers are doing it...
text. All browsers need to do is NOT to break where line breaking is
prohibited as specified in JIS X 14xxx(?)[1] and to break on other places
(syllable boundaris, character boundaries[2]) to make text as justified (on both sides)
as possible. The same is true of Korean and Chinese. It doesn't make any
difference whether space is used or not in Japanese/Korean/Chinese.
Mozilla (and I guess MS IE as well) supports JIS X 14xxx for Japanese,
Korean and Chinese.[3] A harder than this is That text and that's
where you need to pay more attention. Thai line breaking rule is also
supported by Mozilla.
As I wrote earlier, programs like 'fmt' should support this.
Netscape 3.x broke lines ONLY at spaces so that some Korean web page
authors used a simple perl script to insert <wbr> tag everywhere(every
syllable boundary) linebreaking is allowed.
[1] The prohibition rule is not a rocket science. You can easily guess it. Here are some examples:
- lines cannot be broken after an opening quoation mark, single
or double. That is, a line cannot end with them.- lines cannot be broken before a comma, a period, a question mark, an exclamation mark That is, a line cannot begin with them.
- There are some Kana-specific rules I don't remember at the moment.
[2] To generalize, I'd use 'grapheme boundaries'. See Unicode TR #29 for details.
[3] See also Unicode TR #14. When you read UTR #14, be aware that its treatment of Korean linebreaking is not satisfactory. Simply put, Korean text can be broken at any *grapheme boundaries* (when NFC is used for modern text, it means at any Unicode codepoint boundaries for modern syllables) as well as at space except for about a dozen places where line breaking is prohibited. (see JIS X 14xxx aforementioned). 99% of Korean text in print use layout justified on both sides, formal or informall but TR #14 gives a *wrong* impression that about half of Korean text use linebreaking only on space and ragged justification style. The author of TR #14 wouldn't listen to my feedback insisting that he's got plenty of printed materials contradicting what I had told him which he appreciated at the end of TR #14.
-- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
