Hi,

Andreas Delmelle wrote:
> On Sep 3, 2008, at 18:35, Steffanina, Jeff wrote:
> 
> Hi Jeff
> 
>> There is always one MORE option to consider!!
>>
>> What would you suggest as the best way to handle this?
> 
> I think I'd opt for using (N)umeric (C)haracter (R)eferences. Reasoning
> would be that if one changes the BASIC code to emit the sequence
> 'è', this will never, ever have to be changed (unless Unicode would
> somehow decide on altering the codepoints). You can change the encoding
> in the XML header all you want, NCRs will always work.
> 
> On the other hand, if you have a LOT of those characters, using NCRs
> could make your XML a bit bulky (instead of 1 byte/character, you

Not mentioning the fact that this would make the document really tedious
to type, and not very readable...


> actually generate 6-8 bytes to represent one character in the final
> result; the XML parser, instead of needing only one byte, has to parse
> all bytes from '&' up to and including ';').
> The character code you mentioned earlier (130) is the decimal value for
> 'é' in ASCII, so if you're concerned with the size of the XML and do not
> want to generate 6 bytes for one character, try specifying "US-ASCII" as
> encoding for the source XML.

No, US-ASCII is a 7-bit character set, which means it can contain only
128 characters, none of them being an accented letter [1].

>From your other message it looks like the default character set on your
system is ISO-8859-15, which is ok for all of the western languages plus
a few more [2]. Your BASIC program probably uses that character set, in
which case you just have to change the header of your xml file:
    <?xml version="1.0" encoding="ISO-8859-15"?>

As long as you put the right header in the XML file you can live with
that setup. However, it is safer to switch to UTF-8 now, in order to
avoid troubles in the future. Indeed, it’s probable that when you change
your computer or upgrade your system the default character set will
become UTF-8. Then if you re-edit that file on the new system, accented
letters will be entered as UTF-8 sequences that are incompatible with
ISO-8859-15, and you’ll basically see garbage in the result. Unless your
editor is elaborate enough to recognize that the file is xml, and parses
the header to get its encoding. But I doubt many editors do that...

You can choose to convert your files to UTF-8 later on, but that might
represent a lot of work, plus you will have to edit every file to change
the xml header to UTF-8. Since the use of UTF-8 as the default charset
will happen sooner or later, you better do that now, when you don’t have
too many files.

Changing the default character set is very system-dependent. Basically
you have to play with the LOCALE variable. You can (may) get a list of
available locales by typing the following command in a terminal:
    $ locale -a
    C
    en_US.iso885915
    en_US.utf8
    ...

If no UTF-8 locale is available it must be generated. Try to find
documentation for your system or ask the system administrator if
applicable...

You find that complicated? It is, it has always been, and I’m afraid it
may forever be. This is historical...

[1] http://en.wikipedia.org/wiki/Ascii
[2] http://en.wikipedia.org/wiki/ISO/IEC_8859-15

HTH,
Vincent

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to