Jordi Salvat i Alabart wrote:
> You're indeed pointing to a whole collection of bugs... but none of them
>  seems to be the one affecting you :-)
>
> [...]
> - Last line trim in ResultCollector -- but only last line trim! It
> should not make any difference whether you use UTF-8 or ISO-Latin-1
> here, as far as you use the same one for reading and writing. Still, it
> could fail if the platform encoding is one for which the UTF-8
> representation of some character used in the file is not a valid
> character representation. (Sorry for the very clear statement -- that's
> about as good as my English can be.) ISO-Latin-1 is pretty safe, but
> other platforms will of course use others...

Well, actually this _is_ the use of jorphan.io.TextFile that is causing my
problems. Reading a UTF-8 file with an ISO-8859-1 encoding and then
writing it out with that same ISO-8859-1 encoding causes a few UTF-8
sequences to be altered.

I modified jorphan.io.TextFile to keep a backup copy of the file it was
writing over and I could see a difference between the first part of the
two files in a few UTF-8 sequences.

> [...]
> Whether TextFile should use a given encoding or just the platform
> default can be discussed, but it certainly should be documented.

Either it should read the file as binary or an encoding should be passed
by its caller, so that when it is used to trim the last line of a result
file the encoding can be set to UTF-8.

> Also, it's quite obvious that the ResultCollector should not handle
> response data as character data, since in many cases it's binary stuff,
> and any character encoding (UTF-8 or whatever) will be wrong. Actually,
> XML is a bad format for binary data: we should either store that in
> separate files or encode it base-64 or alike.

I agree completely. The binary data not only contains "funny" UTF-8
characters that cause problems, it also contains XML entities such as 
which our XSLT processor can't handle.

> [...]
> In any case, as I said, I can't see how you can end up with a result XML
> file with ISO-8859-1 content. Are you sure about that?

The content is not real ISO-8859-1 content, but corrupted UTF-8 content. I
changed jorphan.io.TextFile to read and write the file with UTF-8 encoding
and then the problem disappeared.

Regards, Vincent.





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to