Re: [HTML5] 2.8 Character encodings

Anne van Kesteren Tue, 04 Aug 2009 04:38:08 -0700

On Tue, 04 Aug 2009 12:17:49 +0200, Dr. Olaf Hoffmann<[email protected]> wrote:

Well, this is a problem for CSS too, because some properties are
defined differently in CSS2.1 than in CSS2.


Things change. This is not necessarily a problem in my experience.

I discovered this some time ago for example for clipping for
some SVG test documents, which appeared wrong in Opera.
SVG depends on CSS2, therefore these tests are still well
defined, applied to (X)HTML they are not testable anymore,
because CSS has no version indication.

This is one way of looking at the problem. Another way of looking at theproblem is that specifications cannot incrementally evolve likeimplementations do and are therefore not always accurate in what they say.Just like normal languages Web languages change now and then.

For 'HTML5' - as long as I cannot simply write version="HTML5"
I cannot start to write HTML5 documents.


In effect your documents will be treated as HTML5 regardless.

Already this is a 'show stopper' for 'HTML5' currently. One can stilldiscuss the
current draft, but for formal reasons one cannot write a
'HTML5' document ;o)

As far as I know there are no formal reasons why one cannot write HTML5documents and publish them and in fact many people are authoring HTML5documents and publishing them.

There is no problem to write HTML3.2, HTML4, XHTML1.0,
XHTML1.1 or XHTML+RDFa, even if for some of them the
version indication is not very elegant and not very relevant
for typical user agents.

This assumes versioning is necessary. Experience with Web browsers showsthat this is not needed and avoids a lot of complexity.

No, you can just specify it. Just like you can in HTML4.


I can write the string, but indeed, if I do it, it means 'Windows-1252'.


Not for your authoring tool or a conformance checker.

Therefore effectively, I cannot indicate, that something is
'ISO-8859-1' and not 'Windows-1252'.

You cannot indicate that something needs to be decoded as ISO-8859-1 byWeb browsers for text/html content. This has been the case for a long timeand is nothing new.

Therefore if I start to write some test documents and this problem is
not avoided and a version indication is possible, I think, I will use
UTF-8 for those documents.


This seems like a good idea regardless.


Sure, if you have no history with thousands of documents or scripts.

More and more programming languages work with Unicode internally. Whatscripts would act up?

Typically this means, that they are
incompatible with other of my documents and scripts and will appear
in another directory with an Apache-.htaccess file indicating the
different encoding.


That is one solution. You could also always indicate the encoding in the

document instead and instruct Apache to not include the charsetparameter.


Of course, the document should contain it too. However on many servers
authors have no direct control over the Apache defaults. Therefore it is
always a good idea to ensure, that this works indepentendly from gags
of the administrator.

That is not what I'm saying. What I'm saying you could instruct Apache tonot include the charset parameter so you do not have to maintain adocument/charset mapping within Apache. Just within the documents.

If you simply switch to UTF-8 for all future work this will become less
and less of a problem. And then you've also covered other scripts maythe need arise to use them.
For some projects, it may take several years, until I update them
completely. On one server I still found HTML3.2 documents this
year ;o)


Yeah, this is nothing new :-) Lots of legacy content out there.

More often content is just added or minor bugs are fixed.
I think, this is the same for many authors having already
thousands of documents around somewhere.

If you are just fixing minor bugs I do not see what HTML5 has to do withthis.

Of course if you are 12 to 15 years old, starting your first
project, you could start just from the beginning with a currently
proper choice. However, when you start, you don't understand,
what a proper choice for the future is. Especially for them we
have to keep things simple to get some better quality of
documents in the future. Because 'HTML5' works off a lot
of historical relics and browser bugs, it is not a good
options for a simple start anyway.

HTML5 also removes a lot of historical relics. E.g. SGML. This simplifiesthings _a lot_ for authors in my opinion and experience in talking withWeb developers about this. (And feedback I get from collegues who talk toWeb developers on a near fulltime basis.)

--- off topic ;o) ---
Since HTML5 is no longer SGML based entity definitions there will notwork and are non-conforming. The reason we did this was because otherthan the validator no software processed text/html resources in thisway leading to a lot of author confusion because of the clear mismatchbetween the
validator and other software.
SVG tiny 1.2 documents have typically no doctype, but for this purpose,
it is still pretty useful and an SVG/XML-parser interpretes this.
Because for 'HTML5' it is possible to use an XML-parser, it should
be possible to use this important feature too. Of course, because
it is XML, one can simply start to mix with elements from other
languages too, if this appears to be more convenient as to use
microdata to indicate 'HTML5' elements or their content to
represent the same meaning as those elements from other
namespaces.

Ah. I did not realize you were talking about XHTML5 (HTML5 expressed inXML). In XHTML5 ISO-8859-1 just means ISO-8859-1, not Windows-1252. Thatis just done for text/html documents. You can indeed use DOCTYPE featuresin XHTML5 although I believe it is currently recommended that you do notuse them.



--
Anne van Kesteren
http://annevankesteren.nl/

Re: [HTML5] 2.8 Character encodings

Reply via email to