Re: Java source encoding (was Re: [RTF] Jfor integration)

Peter B. West Sat, 05 Jul 2003 01:55:38 -0700

Victor Mote wrote:

Peter B. West wrote:

(I wouldn't say it was heated.)  I am curious about the impact of
someone working without any formal IDE, and just using (X)Emacs and JDEE
for development.  As far as I know, XEmacs does not support Unicode, but
if the non-ASCII characters were restricted to comments, and XEmacs
thought it was dealing with ISO-8859-15, would there be any actual
problems?  ASCII nulls aren't gling to appear in such UTF8 are they?

As long as 1) the editor doesn't think it needs to change the content when
opening or saving the file, and 2) the non-ASCII characters don't mess up
the editor's display, I don't think there is any problem. The first issue is
easy to test by opening the file, saving as something else, and diffing.
Since control characters have the same code points in ASCII and UTF-8, the
second problem should be a non-issue.

I never bit the emacs bullet, and don't directly know the impact there. In
vi, and Notepad (as in most other non-Unicode editors), you'll see the two
(or more) bytes displayed in their single-byte forms, which makes sense.

I had to look it up to be sure it is true for control characters as well,
but the UTF-8 range 00-7F always represents a single-byte character, so
there should be no ASCII nulls (at least not as the result of an
ASCII-to-UTF-8 conversion).

Yes. Now that I think a bit more about it, UTF8 guarantees that all non-ASCII characters will have the 8th bit set; that's the success of the encoding (which is a delight, btw). So not only will there be no NULs, but no TAB, LF, CR, etc.

The affected changes can be seen at:
http://marc.theaimsgroup.com/?l=fop-cvs&m=105647684725575&w=2
Of course, the files affected are listed there as well if you would like to
test them in your favorite editor.

It may be true that it is safer in the short-term to either 1) eliminate
such characters, or 2) encode them in the \uxxxx format, but I think it
probably makes more sense to simply say that we all need to work in a
Unicode-aware environment or at least a non-Unicode-hostile one. Either way,
this is probably something that we should document in the style guide.

As far as comments are concerned, we can say "Don't do it," because the comment is not going to be readable in any editor that is not Unicode capable, and the \u make no sense in a comment. For code, just use the \u form if it is necessary.

There are examples in alt.design's org/apache/fop/datatypes/CountryLanguageScript.java, generated(!) from xml-lang.xsl and xml-lang.xml, currently in the conf directory. The language codes from ISO 639-2T, ISO 639-2B and ISO 639-1 include the French name. http://www.loc.gov/standards/iso639-2/langhome.html represents ISO 639-2 in four tables, sorted by English name, French name, bibliographic code and teminology code respectively. I included the French names in the XML, and in the generated code, although I have not done the same for script or country codes. The easiest way out is probably to remove the French names, but I am loathe to do that.

Peter
--
Peter B. West  http://www.powerup.com.au/~pbwest/resume.html


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]

Re: Java source encoding (was Re: [RTF] Jfor integration)

Reply via email to