New I18N position at W3C/Keio University

2004-12-22 Thread Martin Duerst
Dear friends, colleagues, everybody, W3C has opened a position in Internationalization at Keio University in Japan, because I'm leaving the W3C Team at the end of March. For details, please see http://www.w3.org/2004/12/i18nposition. For other open positions at W3C, please see

new mailing list: public-ietf-collation@w3.org

2004-08-15 Thread Martin Duerst
Dear Unicoders, Some of you may be interested in this: After discussion with Chris Newman, author of the Internet Draft http://www.ietf.org/internet-drafts/draft-newman-i18n-comparator-02.txt, we have created a new mailing list, [EMAIL PROTECTED], for discussion (and hopefully completion) of this

Character Model: Two new documents and Last Call

2004-02-25 Thread Martin Duerst
The Internationalization Working Group of the W3C is glad to announce the publication of two new documents: Character Model for the World Wide Web 1.0: Fundamentals (http://www.w3.org/TR/charmod, Last Call) and Character Model for the World Wide Web 1.0: Normalization

Re: Transcoding Tamil in the presence of markup

2003-12-07 Thread Martin Duerst
At 23:34 03/12/07 +0900, Jungshik Shin wrote: On Sun, 7 Dec 2003, Peter Jacobi wrote: There is some mixup of lang and encoding tagging, which I didn't fully understand. When lang is not explicitly specified, Mozilla resorts to 'infering' 'langGroup' ('script (group)' would have been a

Re: Transcoding Tamil in the presence of markup

2003-12-07 Thread Martin Duerst
At 23:16 03/12/07 +0900, Jungshik Shin wrote: On Sun, 7 Dec 2003, Peter Jacobi wrote: So, I'm still wondering whether Unicode and HTML4 will consider span style='color:#00f'#x0BB2;/span#x0BBE; valid and it is the task of the user agent to make the best out of it. I think this is valid. I

Re: Fwd: Re: Transcoding Tamil in the presence of markup

2003-12-07 Thread Martin Duerst
Hello Peter, At 13:25 03/12/07 +0100, Peter Jacobi wrote: Dear Doug, All, BTW, your Unicode test page is marked: meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1 This is of course redundant as this is the HTTP default. Well, the HTTP spec unfortunately still says so, but

AddDefaultCharset considered harmful (was: Mojibake on my Web pages)

2003-09-25 Thread Martin Duerst
Hello Doug, others, Here is my most probable explanation: Adelphia recently upgraded to Apache 2.0. The core config file (httpd.conf) as distributed contains an entry AddDefaultCharset iso-8859-1 which does what you have described. They probably adopted this because the comment in the config

Re: Language Tag Registrations

2003-06-02 Thread Martin Duerst
Hello Marion, IANA won't ask your question. They are just the record keeper, they don't make any decisions. If you have a need for identifying a particular kind of language, then what you do is that you submit a registration proposal. Others will then comment on that proposal. If you don't have

Re: BOM's at Beginning of Web Pages?

2003-02-21 Thread Martin Duerst
At 13:14 03/02/18 -0800, Jonathan Coxhead wrote: That's a very long-winded way of writing it! How about this: #!/usr/bin/perl -pi~ -0777 # program to remove a leading UTF-8 BOM from a file # works both STDIN - STDOUT and on the spot (with filename as argument)

Re: [REPOST, LONG] XML and tags (LONG) - SCSU for XML

2003-02-21 Thread Martin Duerst
At 11:24 03/02/21 -0800, Markus Scherer wrote: Marco Cimarosti wrote: BTW, would it be possible to encode XML in SCSU? Yes. Any reasonable SCSU encoder will stay in the ASCII-compatible single-byte mode until it sees a character from beyond Latin-1. Thus the encoding declaration will be

RE: glyph selection for Unicode in browsers

2002-10-08 Thread Martin Duerst
At 13:41 02/10/02 +0900, Martin Duerst wrote: I'm not sure this is possible with Apache, maybe there is a need for a RemoveCharset directive similar to RemoveType (http://httpd.apache.org/docs/mod/mod_mime.html#removetype). Or maybe there is some other way to get the same result. If a new

Call for participation: I18N Activity WG Task Forces

2002-10-03 Thread Martin Duerst
Dear Unicoders, As announced at the International Unicode Conference in San Jose the W3C Internationalization Activity has recently been restructured, and the Internationalization Working Group (WG) and Interest Group (IG) have been re-chartered. We are sure that this will provide you with

RE: glyph selection for Unicode in browsers

2002-10-02 Thread Martin Duerst
At 12:14 02/10/01 -0400, [EMAIL PROTECTED] wrote: I agree that 'sniffing' and 'guessing' are ill-defined, and not to be relied upon. However, I find it a bit 'ill-defined' that there is no well-defined (web server independent) way for the 'users' to override the possibly wrong encoding default

RE: glyph selection for Unicode in browsers

2002-09-30 Thread Martin Duerst
At 07:37 02/09/26 +0900, [EMAIL PROTECTED] wrote: I would be happy if just this meta http-equiv=Content-Type content=text/html; charset=utf-8/ would be enough to convince the browsers that the page is in UTF-8... It isn't if the HTTP server claims that the pages it serves are in ISO 8859-1. A

Re: browsers and unicode surrogates

2002-04-23 Thread Martin Duerst
At 22:25 02/04/19 +0100, Steffen Kamp wrote: However, when giving the validator a ASCII-only document with a META tag specifying UTF-16 as encoding (just for testing) it says that it does not yet support this encoding, so I don't fully trust the validator in this case. The validator indeed

Re: browsers and unicode surrogates

2002-04-23 Thread Martin Duerst
Just a very small correction: At 07:19 02/04/22 -0400, James H. Cloos Jr. wrote: There are other ways as well. Apache will already (if you use the default configs) add the Content-Language header if you use a filename like foo.en.html. You could have it also add the charset via a similar

New I-D for Internationalized Resource Identifiers

2002-04-17 Thread Martin Duerst
Dear Unicoders, I have just submitted draft-w3c-i18n-iri-00.txt to the Internet Drafts editor. This draft replaces draft-masinter-url-i18n-08.txt. It should be published in a few hours/days. In the mean time it is available at http://www.w3.org/International/2002/draft-w3c-i18n-iri-00.txt.

Re: xml 1.0 and unicode ideograph ext a and ext b

2002-04-06 Thread Martin Duerst
Hello Yung-Fong, First, please send potential error reports to [EMAIL PROTECTED] as indicated in the spec. Second, as somebody else has already said, the XML Core WG is working on extending the repertoire of XML Names in XML Blueberry / XML 1.1. If you have any specific comments, I suggest you

Re: All-kana documents

2002-03-05 Thread Martin Duerst
Character-based compression schemes have been suggested by others. But this is not necessary, you can take any generic data compression method (e.g. zip,...) and it will compress very efficiently. The big advantage of using generic data compression is that it's already available widely, and in

Rechartering the W3C I18N Activity

2002-03-05 Thread Martin Duerst
Dear Unicoders, W3C organized a workshop co-located with the 20th Unicode Conference last month in Washington DC, to discuss the future of the W3C Internationalization Activity. The minutes and results of the workshop are now published at http://www.w3.org/2002/02/01-i18n-workshop

barcodes (was: RE: Passing non-english character params in URL)

2002-02-05 Thread Martin Duerst
Hello Brant, This is not really a Web internastionalization question. Therefore I'm forwarding it to the unicode mailing list. Regards,Martin. At 08:48 02/02/05 -0500, IDAutomation.com, Inc. wrote: I am hoping you can help me with a FileMaker task. We sell barcode fonts and we have several

Re: New plane 1 page for testing your browsers

2002-01-17 Thread Martin Duerst
At 21:44 02/01/06 -0800, James Kass wrote: Martin Duerst wrote, (I wrote,) It would be perfectly correct and might even allow the page to sport one of those valid-HTML gifs from W3. But it doesn't. Just tried changing the charset on an NCR Deseret test page from UTF-8 to US-ASCII. Both

Reminder: Jan 10: W3C I18N Workshop deadline

2002-01-07 Thread Martin Duerst
Dear Unicoders, The deadlines for registrations and submissions for the W3C Internationalization workshop are approaching rapidly; please make sure you don't miss them. Registration deadline: January 10th, 2002 (Thursday) (see

Re: New plane 1 page for testing your browsers

2002-01-06 Thread Martin Duerst
At 00:05 02/01/04 -0500, Tex Texin wrote: Thanks to James Kass, we have a new version of the Unicode examples for plane 1, that uses UTF-8, instead of NCRs. So the following link is to the original page that is code page x-user-defined and uses NCRs for supplementary characters:

Ruby (was: Re: Vertical scripts)

2001-12-26 Thread Martin Duerst
At 17:30 01/12/25 -0800, Michael (michka) Kaplan wrote: From: "$BAk]namdqor(B $BDialamt_dgr"(B [EMAIL PROTECTED] By the way, does any browser in common use support the Ruby extensions to HTML? The 'ruby extensions for HTML' are defined in http://www.w3.org/TR/ruby/, a W3C recommendation.

Re: Rush request for help!

2001-12-23 Thread Martin Duerst
I agree with Jungshik that U+76F4 (straight) is possibly the case where unification went farthest in the sense that it's the case where average modern readers in various areas might be most (1) confused if they see the glyph variant they are not used to. (1) 'most confused' should not be

Re: HTML Validation (was Re: Clean and Unicode compliance)

2001-12-16 Thread Martin Duerst
Hello James (and everybody else), Can you please send comments and bug reports on the validator to [EMAIL PROTECTED]? Sending bug reports to the right address seriously increases the chance that they get fixed. Regards, Martin. At 14:46 01/12/16 -0800, James Kass wrote: Elliotte Rusty Harold

Re: Clean and Unicode compliance

2001-12-16 Thread Martin Duerst
At 07:16 01/12/14 -0800, James Kass wrote: Having an HTML validator, like Tidy.exe, which generates errors or warnings every time it encounters a UTF-8 sequence is unnerving. It's especially irritating when the validator automatically converts each string making a single UTF-8 character into two

Re: Clean and Unicode compliance

2001-12-16 Thread Martin Duerst
As the person who implemented UTF-8 checking for http://validator.w3.org, I beg to disagree. In order to validate correctly, the validator has to make sure it correctly interprets the incomming byte sequence as a sequence of characters. For this, it has to know the character encoding. As an

W3C Internationalization Workshop

2001-12-10 Thread Martin Duerst
Dear Unicoders, W3C is holding a workshop on Internationalization to evaluate the work over the last years and decide on new directions (in particular guidelines and outreach). Details are as follows: Date: 1 February 2002 Location: Omni Shoreham Hotel, Washington DC, USA The Call for

Comments on draft-masinter-url-i18n-08.txt, please

2001-12-09 Thread Martin Duerst
Dear Unicoders, http://www.ietf.org/internet-drafts/draft-masinter-url-i18n-08.txt about the internationalization of URIs (called IRIs) has recently been updated and published. This has been around for a long time, but we plan to move ahead with it in the very near future. Please have a look at

Re: Unicode aware drawing program

2001-12-06 Thread Martin Duerst
I suggest you look at tools that in one way or another produce SVG. SVG is based on XML and therefore supports Unicode. Please see http://www.w3.org/Graphics/SVG/ and http://www.w3.org/Graphics/SVG/SVG-Implementations.htm8#svgedit and below. Please note that not all tools may support the same

RE: Planning a Unicode Only Week

2001-11-29 Thread Martin Duerst
It's very much working that way in any serious browsers. Some font formats (e.g. bitmaps for XWindows on Unix) use layouts corresponding to traditional encodings. Truetype fonts used on many systems can be directly accessed by Unicode, but part of the info in a conversion table is still needed to

RE: japanese xml

2001-08-31 Thread Martin Duerst
At 10:39 01/08/30 +0100, [EMAIL PROTECTED] wrote: Additionally, if you are thinking of XML (or HTML) then you can encode *all* Unicode characters in an EUC-encoded document, by employing numeric character references for characters outside the EUC character repertoire. Using the same technique,

Re: japanese xml

2001-08-31 Thread Martin Duerst
Hello David, What you say is true, but it affects only a very small set of codepoints, mainly symbols. For more documentation, I recommend to read http://www.w3.org/TR/japanese-xml/. Regards, Martin. At 13:13 01/08/30 -0500, David Starner wrote: On Thu, Aug 30, 2001 at 09:51:24AM -0700,

Re: japanese xml

2001-08-29 Thread Martin Duerst
There are lots of examples out there, but mostly in legacy encodings. If you need one in an UTF, just convert it yourself (and make sure you change or remove 'encoding=euc-jp'). XML mandates that every processor (the receiving end) understands UTF-8 and UTF-16, but documents can be in other

Re: Annotation characters

2001-07-23 Thread Martin Duerst
At 01:44 01/07/21 -0400, [EMAIL PROTECTED] wrote: In a message dated 2001-07-20 6:19:24 Pacific Daylight Time, [EMAIL PROTECTED] writes: You can find a better way to do furigana, and an answer to many of your questions, at http://www.w3.org/TR/ruby (the Ruby Annotation Recommendation).

Re: Is there Unicode mail out there?

2001-07-22 Thread Martin Duerst
that a document does not allow C0 control characters, a feature that would be very important for many cases if the basic XML syntax would start to allow C0. Regards, Martin. At 10:32 01/07/19 -0600, Shigemichi Yazawa wrote: At Thu, 19 Jul 2001 15:52:39 +0900, Martin Duerst [EMAIL PROTECTED] wrote

Re: Annotation characters

2001-07-20 Thread Martin Duerst
Hello Patrick, You can find a better way to do furigana, and an answer to many of your questions, at http://www.w3.org/TR/ruby (the Ruby Annotation Recommendation). Regards, Martin. At 18:40 01/07/19 -0400, Patrick Andries wrote: Just a small question about annotation characters. If I

Re: Is there Unicode mail out there?

2001-07-18 Thread Martin Duerst
At 14:30 01/07/17 -0700, Mark Davis wrote: In that case the content of the field is not text but an octet string, and you need to do something different, like base64-ing it. The content in the database is not an octet string: it is a text field that happens to have a control code -- a

Innovative use of Latin ?!

2001-07-02 Thread Martin Duerst
For people interested in new scripts, and new uses of existing scripts :-) http://www.google.com/intl/xx-hacker/ Regards, Martin.

Re: XML Blueberry Requirements

2001-06-22 Thread Martin Duerst
Hello Elliotte, Just two points: - If you are suggesting that discussion move to xml-dev, can you please give the full address of that mailing list? - I suggest you/we don't cross-post [EMAIL PROTECTED], because it's not an issue the Unicode consortium has to decide. (I'm just

Re: UTF-8 signature in web and email

2001-05-18 Thread Martin Duerst
At 22:58 01/05/17 -0400, [EMAIL PROTECTED] wrote: Martin D$B—S(Bst wrote: There is about 5% of a justification for having a 'signature' on a plain-text, standalone file (the reason being that it's somewhat easier to detect that the file is UTF-8 from the signature than to read through

Re: UTF-8 signature in web and email

2001-05-16 Thread Martin Duerst
Hello Roozbeh At 04:02 01/05/15 +0430, Roozbeh Pournader wrote: Well, I received a UTF-8 email from Microsoft's Dr International today. It was a multipart/alternative, with both the text/plain and text/html in UTF-8. Well, nothing interesting yet, but the interesting point was that the HTML

Re: Unicode in a URL

2001-04-26 Thread Martin Duerst
At 11:28 01/04/26 -0700, Markus Scherer wrote: Paul Deuter wrote: I am wondering if there isn't a need for the Unicode Spec to also dictate a way of encoding Unicode in an ASCII stream. Perhaps How many more ways to we need? To be 8-bit-friendly, we have UTF-8. To get everything into ASCII

Re: Unicode in a URL

2001-04-26 Thread Martin Duerst
Hello Paul, At 19:41 01/04/25 -0700, Paul Deuter wrote: I am struggling to figure out the correct method for encoding Unicode characters in the query string portion of a URL. There is a W3C spec that says the Unicode character should be converted to UTF-8 and then each byte should be encoded as

RE: Unicode in a URL

2001-04-26 Thread Martin Duerst
At 15:02 01/04/26 -0700, Paul Deuter wrote: Based on the responses, I guess my original question/problem was not very well written. The %XX idea does not work because this it already in use by lots of software to encode many different character sets. So again we need something that identifies

RE: Unicode in a URL

2001-04-26 Thread Martin Duerst
Hello Mike, At 19:09 01/04/26 -0600, Mike Brown wrote: W3C specifies to use %-encoded UTF-8 for URLs. I think that's an overstatement. Neither the W3C nor the IETF make such a specification. True. Neither W3C nor IETF make such a general statement, because we can't just remove the about 10

RE: How will software source code represent 21 bit unicode characters?

2001-04-17 Thread Martin Duerst
At 09:29 01/04/17 -0500, [EMAIL PROTECTED] wrote: In a perfect world, we would probably have an enclosing symbol (e.g. '\4E00') so that the number can be variable length. tuning into another language channel In Perl the notation is \x{...}, where ... is hexdigit sequence: \x{41} is LATIN

Re: Reviewing IETF documents

2001-04-16 Thread Martin Duerst
Hello Florian - There is no official or coordinated review of IETF documents. Because of the volunteer nature of the IETF, it mostly depends on individuals. I have been in contact with the USEFOR group for a while. What particular serious problem are you speaking about? If you know about a

Re: Identifiers

2001-04-15 Thread Martin Duerst
Hello Florian, Of course, KC/KD-normalization is not sufficient. The problem already exists in ASCII. I/l/1 and 0/O can easily be confused. It will always be necessary for people to think a bit when creating their email addresses,... On the other hand, when identifiers can be written in various

Ruby Annotation and XHTML 1.1 are W3C Proposed Recommendations

2001-04-08 Thread Martin Duerst
Ruby Annotation (http://www.w3.org/TR/ruby) and XHTML(TM) 1.1 - Module-based XHTML (http://www.w3.org/TR/xhtml11) became W3C Proposed Recommendations on April 6, 2001. Abstract of 'Ruby Annotation': "Ruby" are short runs of text alongside the base text, typically used in East Asian documents