Re: [ilugd] Publishing UTF-8 encoded multilingual XHTML documents on web

आशीष शुक्ला \"Wah Java !!\" Fri, 28 Apr 2006 08:07:51 -0700

Mithun Bhattacharya wrote:
> 
> --- "à¤†à¤¶à¥€à¤· à¤¶à¥à¤•à¥à¤²à¤¾ \"Wah Java !!\""
> <[EMAIL PROTECTED]> wrote:
> 
>> Hi Gora G,
>>
> 
>> First of all, sorry, the ISO-8859-1'ed doc's URL is:
>> http://unixclan.no-ip.org/~21287/index.html
>>
>> And now, BOM'd UTF-8 document's URL is:
>> http://unixclan.no-ip.org/~21287/index-bom.html
>>
>> Well, your suggestion works in Konqueror 3.5.2 (which I'm not
>> expecting it to 
>> work, because Konqueror, has to interpret BOM characters based on
>> current 
>> encoding which is ISO-8859-1, therefore Konqueror should ignore it,
>> but it uses 
>> BOM to set encoding, which is not acceptable according to HTML
>> specification), 
>> but not in Mozilla Firefox 1.5 which displays BOM characters as it
>> is.
>>
>> I think this is the problem with HTML specification which says, HTTP
>> header 
>> emitted by server should be given priority in deciding content-type.
>> But 
>> according to me, only a document knows in what encoding it is
>> encoded, therefore 
>> document's encoding should be given priority.
> 
> I am not sure which HTML specification you are looking at but the W3
> page says quite opposite of what you are claiming


I'm also looking at the same HTML v. 4.01 specification.

> http://www.w3.org/TR/html4/charset.html

above URL also says, this:

-- begin quote --
To sum up, conforming user agents must observe the following priorities when 
determining a document's character encoding (from highest priority to lowest):
1. An HTTP "charset" parameter in a "Content-Type" field.
2. A META declaration with "http-equiv" set to "Content-Type" and a value set 
for "charset".
3. The charset attribute set on an element that designates an external resource.

In addition to this list of priorities, the user agent may use heuristics and 
user settings. For example, many user agents use a heuristic to distinguish the 
various encodings used for Japanese text. Also, user agents typically have a 
user-definable, local default character encoding which they apply in the 
absence 
of other indicators.

User agents may provide a mechanism that allows users to override incorrect 
"charset" information. However, if a user agent offers such a mechanism, it 
should only offer it for browsing and not for editing, to avoid the creation of 
Web pages marked with an incorrect "charset" parameter.
-- end quote --

> http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html

HTTP/1.1 RFC, yeah I already examined it when I was investigating how Language 
preferences in web browsers work.


> Basically a sample interaction between a browser and a HTTP server goes
> like this in terms of document encoding:
> 
> 1. Browser sends request to the webserver with the Accept-Charset
> header eg Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7. The charset
> could be a list in which case the values are in decreasing order of
> priority. The q value mentions the allowed degradation in quality of
> the content if selecting the specific charset in this case utf-8 or any
> charset other than ISO-8859-1
> 2. Server responds with the charset as part of the content-type header
> eg content-type:text/html; charset=ISO-LATIN-7 If none of the
> acceptable charset mentioned by the browser is available at the server
> side then a 406 response is sent.

This kind of interaction is great, but it is not the only kind of interaction 
we 
have. I mean, it works when you have document in multiple encodings, and 
depending on user agent preferences, you respond. And, also there has to be 
someway, by which we can inform our webserver that document.html, 
document.utf8.html, document.iso-8859-1.html are same docs in different 
encodings. But, my thing is (explained with an example):

-- begin example --
A document in Hindi language, is placed in the English section, so a reader 
searching in English section, when encounters this document faces the same 
problem. So, it is better, if that Hindi language document, has something on 
the 
coverpage which says, "This is an Hindi language document", and also reader 
should look for this sentence on the coverpage, before starting to  read any 
book. But, what actually happening is reader is told strictly that if you find 
any document in English section, it is in English, but if it is found in 
unnamed 
section, then u should look for the sentence on its coverpage.
-- end example --

But here, in our case, HTML specs clearly mentions
> 
> The majority of the problem starts now. The standards say that the
> content-type specified by the server is a recommendation or a guideline
> and not an overriding instruction. The browser is supposed to accept
> the data in good faith but is supposed to use it's own judegement in
> handling the data. This is the reason why all browser give you an
> option to change the charset being used to render the current page.

BTW, which standards says it and where ??

So, in other words, browser should not trust server.

http://www.w3.org/Protocols/rfc2616/rfc2616-sec7.html#sec7.2.1
-- begin quote --
Any HTTP/1.1 message containing an entity-body SHOULD include a Content-Type 
header field defining the media type of that body. If and only if the media 
type 
is not given by a Content-Type field, the recipient MAY attempt to guess the 
media type via inspection of its content and/or the name extension(s) of the 
URI 
used to identify the resource. If the media type remains unknown, the recipient 
SHOULD treat it as type "application/octet-stream"
-- end quote --

> 
> Next problem is UTF-8 encoding itself. This was developed after UTF-16
> and UTF-32 came into the picture primarily because it was backward
> compatible with ISO-8859-1. Do note most browser and HTTP server will
> default to ISO-8859-1 if a specific character set is not defined.

Yeah, that's the standard.

> Therefore the first 127 characters are exactly the same in UTF-8 and

Right again.

> ISO-8859-1 Any attempt at autodetecting character encoding will fail
> since there is no way to differentiate between a UTF-8 encoded
> character or a two ISO-8859-1 encoded characters. Thats the reason why
> you see funny characters on your screen if there is a missmatch in the
> server response and the page encoding.

OK

> 
> There is a way around it too as mentioned here 
> http://www.w3.org/TR/html4/charset.html#encodings Basically if you
> follow standards there is no scope for default value for document
> character set encoding. You are supposed to specifically mention which
> characterset to use to render the document inside the document itself.
> <META http-equiv="Content-Type" content="text/html; charset=EUC-JP">

This should work only when "Content-Type" HTTP header is absent.

> This should be at the begining of the document and in english
> (ISO-8859-1 has some unfair advantage since it was the first encoding
> used on the web) In theory every document should have it and every
> renderer should strictly adhere to it - irrespective of what anything
> else might suggest - unless ofcourse the user wants to override it. In
> practice we all default to ISO-8859-1 and follow the server side
> recommendation or specific document encoding if present otherwise most
> of the internet wouldnt render at all.
> 
> Coming to BOM I refer to http://www.w3.org/TR/html4/charset.html - read
> the section "Notes on specific encodings" which seems to say BOM should
> be used only if UTF-16 data is present. Also it should be the first
> byte to be transmitted to the user-agent - I am not sure whether that
> implies it should be before the HTTP headers or after that.
> 
> I guess that is all I can think of at the moment. Hopefulyl it has been
> of some use.
> 

Oh definitely, it is useful.

> 
> Mithun
> 

Thanks,
Ashish Shukla
-- 
आशीष शुक्ला alias "Wah Java !!"
http://wahjava.blogspot.com/

The only key to optimal life is precision.

                                -- Ashish Shukla "Wah Java !!"
                       http://wahjava.blogspot.com/2006/03/useful-thought.html


_______________________________________________
ilugd mailinglist -- [email protected]
http://frodo.hserus.net/mailman/listinfo/ilugd
Archives at: http://news.gmane.org/gmane.user-groups.linux.delhi 
http://www.mail-archive.com/[email protected]/

Re: [ilugd] Publishing UTF-8 encoded multilingual XHTML documents on web

Reply via email to