HTML 5 is the up-and-coming version of the HTML standard, which supports all sorts of new and exciting features. For those who don't know about it, here's some background:
Wikipedia article: http://en.wikipedia.org/wiki/HTML_5 Summary of major differences from HTML 4: http://www.w3.org/TR/html5-diff/ Full specification: http://dev.w3.org/html5/spec/Overview.html It's clear at this point that HTML 5 will be the next version of HTML. It was obvious for a long time that XHTML was going nowhere, but now it's official: the XHTML working group has been disbanded and work on all non-HTML 5 variants of HTML has ceased. (Source: <http://www.w3.org/2009/06/xhtml-faq.html>) MediaWiki will have to switch to HTML 5 sooner or later. It's a great standard, and I think we would do well to be early on the curve here and help spark interest in and support for it. HTML 5 is designed to be backward-compatible with legacy content, both on the authoring side and (especially) the implementation side. Well-written XHTML 1.0 should theoretically need only minor modifications to validate as HTML 5, and indeed this appears to be the case in practice. All that's required to get a typical page in Monobook validating as HTML 5 in the W3C's experimental validator is (*if* we disregard user-added markup): * Change the doctype to "<!doctype html>". * Delete '<meta http-equiv="Content-Style-Type" content="text/css" />'. Which is a really stupid element anyway. :P * Delete name attributes from all <a> elements. They've been redundant to id for eternity, and every browser in the universe supports id; we can finally move these to the headers themselves. * Remove comments from inside <script> tags with a src attribute. I already did this in r52828, since they're pointless anyway. (The W3C validator is at http://validator.w3.org/. You can override the doctype and set it to interpret Wikipedia URLs as HTML 5 under "More Options".) Note that HTML 5 does follow in the "strict" vein of XHTML. Presentational elements and attributes such as font, border, cellpadding, etc. are all invalid in HTML 5. (Implementations must support them, but conforming documents must not use them. <b> and <i> remain valid.) There's very little of this stuff left in the HTML that ships with the software. We can remove this incrementally as it's reported. For user-added content, I think it's fair to just treat it as GIGO -- if they submit invalid content that can't be easily converted to a valid form, it will be output as-is. Users can already submit invalid content in cases where we can't easily fix it, e.g., duplicate id's. If we switch to HTML 5, the W3C validator will begin outputting errors on this presentational stuff, which should hopefully encourage users to reduce it over time, at least in high-profile places like the front page or infoboxes. So converting to HTML 5 would be trivial. However, in addition to lending our support to good standards, there are several modest practical benefits that would accrue from the switch. I include here only things that are possible in valid HTML 5 documents, but which would not validate as XHTML 1 (so excluding stuff like localStorage); and which are usable right now (so excluding stuff like <nav>, <input type=color>, etc.): * HTML 5 permits omission of a lot of the cruft that XHTML requires. It permits leaving off ending tags in most cases where that's unambiguous, and leaving off some required tags entirely (such as <html>, <head>, and <body> if they have no attributes). The "/>" ending is no longer required. Superfluous attributes like type=text/javascript on <script> are no longer needed (unless you want to use <script type=application/x-python or something, of course!). Quotes may be omitted from attributes in almost all cases. The doctype is shorter and easy to remember, and there is no xmlns attribute. For an example of how compact valid HTML 5 can be, look at the source of http://aryeh.name/. I once did a crude test and found we could cut 5% or so off the length of our HTML by doing this -- *after* gzipping. Not only does this make our code smaller, it will also make it easier to read. * We could support <video>/<audio> on conformant user agents without the use of JavaScript. There's no reason we should need JS for Firefox 3.5, Chrome 3, etc. * We can use data-* attributes to store custom data for scripts. This came up in the case of the HTML diff work: the author of that stuck some data for scripts in custom attributes, which caused XHTML 1 validation to fail. * We can use HTML 5 form attributes. These will enhance the experience of users of appropriate browsers, and do nothing for others. At least Opera 9.6x already supports almost all HTML 5 form attributes. (Source: <http://www.opera.com/docs/specs/presto211/forms/>) We could, for instance, give required fields the "required" attribute, which will cause the browser to prevent the form submission and notify the user if they aren't filled in, without needing either JavaScript or a server-side check. The "pattern" attribute even allows requiring that the input match a regex, and this is also supported by Opera 9.6x. See <http://dev.w3.org/html5/spec/Overview.html#common-input-element-attributes>. * There are a couple of parser tests that currently fail because of misnested tags. If we altered the parser to no longer output any </p> tags (which HTML 5 permits), these tests would immediately pass. It doesn't look like anyone's going to fix them otherwise. These are only a few of the things that have immediate concrete benefit. There are probably more I couldn't find immediately (HTML 5 is a huge spec), and of course in the long term there's an incredible amount that would be invaluable to us. I propose the following migration plan: 1) Fix the doctype, Content-Style-Type, and name attributes. We can then officially claim we're shipping HTML 5! :) (Albeit maybe invalid in some cases.) Also remove any unnecessary attributes and elements, without breaking XML well-formedness. Begin using HTML 5 form attributes and any other useful features. Poke the Cortado people about letting <video> work without JavaScript. 2) Once this goes live, if no problems arise, try causing an XML well-formedness error. For instance, remove the quote marks around one attribute of an element that's included in every page. I suggest this as a separate step because I suspect there are some bot operators who are doing screen-scraping using XML libraries, so it would be a good idea to assess how feasible it is at the present time to stop being well-formed. In the long run, of course, those bot operators should switch to using the API. If we receive enough complaints once this goes live, we can revert it and continue to ship HTML 5 that's also well-formed XML, for the time being. 3) If XML well-formedness is not a problem, get rid of all unneeded closing tags, quotation marks, self-closing "/>" constructs, etc. Create an Html class like Xml, which will generate elements in the nice compact form that HTML 5 permits, and phase out use of Xml in favor of Html. (Xml has long since ceased to be purely about XML anyway.) So, what are people's thoughts? _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l