Ignacio Renuncio <[EMAIL PROTECTED]> wrote: > > Hi again, > > I've found a little problem with the XML importer related to the charset > used: > > I took a sample XML and imported it ok, but when I tried to do it with a > real one, some characters displayed wrong when viewing the new records with > the "my_editors" editor. > > After examining the XML I could view that the offending characters were > related to the encoding. I changed the encoding to ISO-8859-1 and tried to > save the XML, but XMLSPY complained about it telling me: > > "Your document contains 13 character(s) that cannot be represented in the > ISO 8859-1 (Latin-1/West European) character-set encoding. (...blah > blah...)" > > BTW, the offending characters are 0x2026 (three dots character) and 0x2013 > (typographical dash), they seem to have been "auto-formatted" MS Word when > typing the texts.
I suppose XMLImporter uses a decent XML-parser. XML is defaultly encoded in UTF-8, so you should have written your data as UTF-8. If you change that, you should also change the actual encoding of the characters. I suppose that is immpossible for the two said characters because they are not part of the iso-8859-1 characters set. You should use a kind of word-filter, or stick to UTF-8. But, anyhow, it should not matter how the source XML is encoded, as long as it is correctly encoded. > 1st question: Which encoding does MMBase use to display data? That is not determined. Internally MMBase is java, which does not specify a encoding for Strings. Strings can contain any character from the Unicode character set. The 'basic jsp' editors use UTF-8 for displaying, and I suppose also the my_editors do that. > 2nd question: Is this a fault in the XML Importer module? That is possible (which would mean that it does not use an decent XML parser as I suggested earlier), the other possibility is that there is an error in your XML's encoding. Michiel -- Michiel Meeuwissen Mediacentrum 140 H'sum +31 (0)35 6772979 nl_NL eo_XX en_US mihxil' [] ()
