Hi, Andreas Delmelle wrote: > On Sep 3, 2008, at 18:35, Steffanina, Jeff wrote: > > Hi Jeff > >> There is always one MORE option to consider!! >> >> What would you suggest as the best way to handle this? > > I think I'd opt for using (N)umeric (C)haracter (R)eferences. Reasoning > would be that if one changes the BASIC code to emit the sequence > 'è', this will never, ever have to be changed (unless Unicode would > somehow decide on altering the codepoints). You can change the encoding > in the XML header all you want, NCRs will always work. > > On the other hand, if you have a LOT of those characters, using NCRs > could make your XML a bit bulky (instead of 1 byte/character, you
Not mentioning the fact that this would make the document really tedious to type, and not very readable... > actually generate 6-8 bytes to represent one character in the final > result; the XML parser, instead of needing only one byte, has to parse > all bytes from '&' up to and including ';'). > The character code you mentioned earlier (130) is the decimal value for > 'é' in ASCII, so if you're concerned with the size of the XML and do not > want to generate 6 bytes for one character, try specifying "US-ASCII" as > encoding for the source XML. No, US-ASCII is a 7-bit character set, which means it can contain only 128 characters, none of them being an accented letter [1]. >From your other message it looks like the default character set on your system is ISO-8859-15, which is ok for all of the western languages plus a few more [2]. Your BASIC program probably uses that character set, in which case you just have to change the header of your xml file: <?xml version="1.0" encoding="ISO-8859-15"?> As long as you put the right header in the XML file you can live with that setup. However, it is safer to switch to UTF-8 now, in order to avoid troubles in the future. Indeed, it’s probable that when you change your computer or upgrade your system the default character set will become UTF-8. Then if you re-edit that file on the new system, accented letters will be entered as UTF-8 sequences that are incompatible with ISO-8859-15, and you’ll basically see garbage in the result. Unless your editor is elaborate enough to recognize that the file is xml, and parses the header to get its encoding. But I doubt many editors do that... You can choose to convert your files to UTF-8 later on, but that might represent a lot of work, plus you will have to edit every file to change the xml header to UTF-8. Since the use of UTF-8 as the default charset will happen sooner or later, you better do that now, when you don’t have too many files. Changing the default character set is very system-dependent. Basically you have to play with the LOCALE variable. You can (may) get a list of available locales by typing the following command in a terminal: $ locale -a C en_US.iso885915 en_US.utf8 ... If no UTF-8 locale is available it must be generated. Try to find documentation for your system or ask the system administrator if applicable... You find that complicated? It is, it has always been, and I’m afraid it may forever be. This is historical... [1] http://en.wikipedia.org/wiki/Ascii [2] http://en.wikipedia.org/wiki/ISO/IEC_8859-15 HTH, Vincent --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]