OT - Re: [idn] URL encoding in html page

Mark Davis Fri, 22 Mar 2002 10:09:20 -0800

From my experience talking with customers in the field, the main reason that people are not serving up UTF-8 pages is not the bandwidth, it is the fact that there are still some browsers out in the field that do not yet handle it correctly. While they are dying off fairly quickly, it is not quite at the point where people are willing to write them off.

As far as size goes, it is worthwhile looking at some data samples. The following are from a page on the Unicode site that is translated into different languages, so it has essentially the same information on each page.

Size	Page
8882	s-chinese.html
8946	t-chinese.html
9347	esperanto.html
9498	maltese.html
9739	icelandic.html
9833	czech.html
9944	welsh.html
10064	danish.html
10109	swedish.html
10127	polish.html

Size	Page
10219	interlingua.html
10221	italian.html
10297	spanish.html
10308	portuguese.html
10312	lithuanian.html
10329	german.html
10376	romanian.html
10401	korean.html
10506	french.html

Size	Page
10726	japanese.html
10953	hebrew.html
11192	arabic.html
13292	greek.html
13870	russian.html
13892	persian.html
14549	hindi.html
15337	georgian.html
15853	deseret.html

So the best case is about 50% of the worst case. Some of this is due to the encoding, and some is due to different languages just using different numbers of characters. However, when you look at web pages in general use, the amount of text (in bytes) is really swamped by graphics, Javascript, HTML code, and so on. So fundamentally, even the variations above are not that important in practice.

BTW This is getting way off topic.

Mark

—————

Γνῶθι σαυτόν — Θαλῆς
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

----- Original Message -----

From: "Soobok Lee" <[EMAIL PROTECTED]>

To: "Mark Davis" <[EMAIL PROTECTED]>; "IETF idn working group" <[EMAIL PROTECTED]>

Sent: Friday, March 22, 2002 08:16

Subject: Re: [idn] URL encoding in html page

>
> ----- Original Message -----
> From: "Mark Davis" <[EMAIL PROTECTED]>
> To: "Soobok Lee" <[EMAIL PROTECTED]>; "IETF idn working group" <[EMAIL PROTECTED]>
> Sent: Saturday, March 23, 2002 12:18 AM
> Subject: Re: [idn] URL encoding in html page
>
>
> > Compliant browsers already have to handle Unicode, since NCRs (e.g.
> > ሴ ) are always Unicode code points. All XML parsers also have
> > to handle Unicode (UTF-8 and UTF-16).
>
> Right, Already.
> MS IE and NEtscape already have been supporting UNICODE
> from serveral year ago, but still most homepages are in legacy encodings.
> MS WORD (already unicode based) have features to produce (from
> unicode-based .doc files) legacy encoded .html files for web publishing
>
> Korean/Japanese/Chinese texts in UTF8 are 50% bigger than legacy ones.
> 50% more disk space and bandwidth will be required.
> Each Cyrillic alhpabet in legacy code occupy one octet, while in UTF8,
> it requires 3 octets. 200% more space is needed.
> I cannot imagine the entire Russians make transition to UTF8.
> Legacy encnodings are more space efficient than UNICODE.
>
> legacy-to-legacy conversions like BIG5->KSX1001 are really being implemented
> as two steps of BIG5->UNICODE and UNICODE->KSX1001. UNICODE
> are actively used as such intermediate encodings, but still not be used and entered
> directly by end users so actively. Rather, UNICODE may be a hub to facilitate interchange
> of informations in different legacy encodings or font sharing for differently legacy-encoded chars.
>
> I regard UNICODE as a substrate (not as a competitor) upon which legacy encodings are built.
>
> >
> > > Legacy encodings
> > > will dominates even in the future, because it is compact and
> > > inexpensive.
> >
> > While I do expect the transition to Unicode to take some time, once
> > some of the older browsers die off it may shift more rapidly than we
> > think.
>
> I am not UNICODE expert nor character expert. But, everyday, i feel
> the strong inertia toward legacy encodings in our local language communties.
> language-tagging-enabled text format like HTML will lengthen the lifespan
> of legacy encodings by great amounts and allow legacy-coded HTML texts
> are internationally interchanged without problems.
>
> Soobok Lee
>
> >
> > Mark
> > —————
> >
> > Γνῶθι σαυτόν — Θαλῆς
> > [For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]
> >
> > http://www.macchiato.com
> >
> > ----- Original Message -----
> > From: "Soobok Lee" <[EMAIL PROTECTED]>
> > To: "IETF idn working group" <[EMAIL PROTECTED]>
> > Sent: Friday, March 22, 2002 02:04
> > Subject: Re: [idn] URL encoding in html page
> >
> >
> > >
> > > ----- Original Message -----
> > > From: "Bruce Thomson" <[EMAIL PROTECTED]>
> > > To: "Soobok Lee" <[EMAIL PROTECTED]>; "IETF idn working group"
> > <[EMAIL PROTECTED]>
> > > Sent: Friday, March 22, 2002 6:29 PM
> > > Subject: Re: [idn] URL encoding in html page
> > >
> > >
> > > > > What if all the html viewable text is in english, but, only the
> > href url contains
> > > > > legacy (korean) encoded hostnames? chinese visitors would see
> > clean english homepage,
> > > > > but fail to click through the korean link.
> > > > >
> > > > Well, that could happen, but a META tag would solve that so
> > easily. Personally
> > > > I often use a simple text editor to deal with HTML, and would find
> > it easier to
> > > > use legacy encodings or UTF-8 than cut-and-paste ACE from
> > somewhere.
> > > > Of course the user could do it either way and it would work.
> > >
> > > Yes. Charset META tags help. But, many homepages have assumptions
> > on the main audience's
> > > default char encodings and very often omit the META tag for the
> > encoding like :
> > > <meta http-equiv="Content-Type" content="text/html;
> > charset=euc-kr">
> > >
> > > Moreover, IDN url would be used in a pure FRAMESET document that
> > defines frame URLs
> > > and contains no viewable texts. Such FRAMESET documents often omit
> > charset META tags.
> > > (look into the html source of http://www.freeway.co.kr/ )
> > >
> > > AFIAK, 99.99999% of korean homepages have implicit/explicit
> > > legacy korean encoding (KS_C_5601-1987 or euc-kr). So do most
> > japanese/chineses homepages.
> > > UTF8/UCS-2 encodings are rarely used in global WEB publishing.
> > Legacy encodings
> > > will dominates even in the future, because it is compact and
> > inexpensive.
> > >
> > > IF we want to make IDN truly internationally interoperable, all
> > IDN-aware webbrowsers/applications
> > > should contain libaries of all kinds of legacy-to-Unicode conversion
> > routines. It will burden
> > > too much memory load on handheld devices like PDA.
> > >
> > > Moreover, legacy encodings are revised separately from unicode. We
> > may face with as toughest
> > > versioning problems as we did in stringprep/nameprep versioning
> > problems for newly added unicode points.
> > > How to guarantee stability and intergrity of IDN operations in the
> > all combinations of numerous kinds and versions of iDN-aware
> > > applications and legacy encodings?
> > >
> > > Soobok Lee
> > >
> > >
> > >
>
>
>

OT - Re: [idn] URL encoding in html page

Reply via email to