Peter B. West wrote:
(I wouldn't say it was heated.) I am curious about the impact of someone working without any formal IDE, and just using (X)Emacs and JDEE for development. As far as I know, XEmacs does not support Unicode, but if the non-ASCII characters were restricted to comments, and XEmacs thought it was dealing with ISO-8859-15, would there be any actual problems? ASCII nulls aren't gling to appear in such UTF8 are they?
As long as 1) the editor doesn't think it needs to change the content when opening or saving the file, and 2) the non-ASCII characters don't mess up the editor's display, I don't think there is any problem. The first issue is easy to test by opening the file, saving as something else, and diffing. Since control characters have the same code points in ASCII and UTF-8, the second problem should be a non-issue.
I never bit the emacs bullet, and don't directly know the impact there. In vi, and Notepad (as in most other non-Unicode editors), you'll see the two (or more) bytes displayed in their single-byte forms, which makes sense.
I had to look it up to be sure it is true for control characters as well, but the UTF-8 range 00-7F always represents a single-byte character, so there should be no ASCII nulls (at least not as the result of an ASCII-to-UTF-8 conversion).
Yes. Now that I think a bit more about it, UTF8 guarantees that all non-ASCII characters will have the 8th bit set; that's the success of the encoding (which is a delight, btw). So not only will there be no NULs, but no TAB, LF, CR, etc.
The affected changes can be seen at: http://marc.theaimsgroup.com/?l=fop-cvs&m=105647684725575&w=2 Of course, the files affected are listed there as well if you would like to test them in your favorite editor.
It may be true that it is safer in the short-term to either 1) eliminate such characters, or 2) encode them in the \uxxxx format, but I think it probably makes more sense to simply say that we all need to work in a Unicode-aware environment or at least a non-Unicode-hostile one. Either way, this is probably something that we should document in the style guide.
As far as comments are concerned, we can say "Don't do it," because the comment is not going to be readable in any editor that is not Unicode capable, and the \u make no sense in a comment. For code, just use the \u form if it is necessary.
There are examples in alt.design's org/apache/fop/datatypes/CountryLanguageScript.java, generated(!) from xml-lang.xsl and xml-lang.xml, currently in the conf directory. The language codes from ISO 639-2T, ISO 639-2B and ISO 639-1 include the French name. http://www.loc.gov/standards/iso639-2/langhome.html represents ISO 639-2 in four tables, sorted by English name, French name, bibliographic code and teminology code respectively. I included the French names in the XML, and in the generated code, although I have not done the same for script or country codes. The easiest way out is probably to remove the French names, but I am loathe to do that.
Peter -- Peter B. West http://www.powerup.com.au/~pbwest/resume.html
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]