Now I understand. I thought iso-8859-1 reading & writing would not change anything. I was obviously wrong.

In any case, it looks like the root problem is that XML is the wrong format for that content. One could argue that it can still be used where the response is text, but what about cases in which the response is also XML? A <![CDATA[ will probably not help, since the response may contain CDATA sections too, and the first ]]> string closes the outer CDATA section (nested CDATA sections are not allowed).

So we're left with two choices:
- Use separate files and only add references to them in the XML file.
- Use base64 encoding for all content.

What do (Vincent and others) think?

--
Salut,

Jordi.

En/na Vincent Partington ha escrit:
Jordi Salvat i Alabart wrote:

You're indeed pointing to a whole collection of bugs... but none of them
seems to be the one affecting you :-)

[...]
- Last line trim in ResultCollector -- but only last line trim! It
should not make any difference whether you use UTF-8 or ISO-Latin-1
here, as far as you use the same one for reading and writing. Still, it
could fail if the platform encoding is one for which the UTF-8
representation of some character used in the file is not a valid
character representation. (Sorry for the very clear statement -- that's
about as good as my English can be.) ISO-Latin-1 is pretty safe, but
other platforms will of course use others...


Well, actually this _is_ the use of jorphan.io.TextFile that is causing my
problems. Reading a UTF-8 file with an ISO-8859-1 encoding and then
writing it out with that same ISO-8859-1 encoding causes a few UTF-8
sequences to be altered.

I modified jorphan.io.TextFile to keep a backup copy of the file it was
writing over and I could see a difference between the first part of the
two files in a few UTF-8 sequences.


[...]
Whether TextFile should use a given encoding or just the platform
default can be discussed, but it certainly should be documented.


Either it should read the file as binary or an encoding should be passed
by its caller, so that when it is used to trim the last line of a result
file the encoding can be set to UTF-8.


Also, it's quite obvious that the ResultCollector should not handle
response data as character data, since in many cases it's binary stuff,
and any character encoding (UTF-8 or whatever) will be wrong. Actually,
XML is a bad format for binary data: we should either store that in
separate files or encode it base-64 or alike.


I agree completely. The binary data not only contains "funny" UTF-8
characters that cause problems, it also contains XML entities such as &#1;
which our XSLT processor can't handle.


[...]
In any case, as I said, I can't see how you can end up with a result XML
file with ISO-8859-1 content. Are you sure about that?


The content is not real ISO-8859-1 content, but corrupted UTF-8 content. I
changed jorphan.io.TextFile to read and write the file with UTF-8 encoding
and then the problem disappeared.

Regards, Vincent.





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to