|
Merry Christmas and welcome back to the list
This is quite deep technical stuff, so sysadmins and Perl
programmers may skip it.
A few days ago I posted a note about encoding problems in
Sablotron 0.44. The problem was not solved since I was waiting for the new
version to come. However nothing has changed, so I have spent some time today
investigating the sources and revealed some interesting things.
Just to remind the real problem:
Assume Sablotron compiled with libiconv on a Linux machine (I
was testing 0.44, because 0.5 doesn't compile at all). After setting XML file
encoding to "windows-1250", setting XSL file encoding to "windows-1250" and
adding <xsl:output> PI with encoding "windows-1250" the following thing
occurs: the result have HTML meta tag with encoding set to "windows-1250" but is
encoded in UTF-8. Of course, browsers are happy about the META tag and the page
looks very bizarre.
Sablotron checks if the encoding is correct using libiconv
functions but it doesn't use encodings which are not listed in iconv_encoding
array (utf8.cpp). Function OutputDefinition::getEncoding() defaults to UTF-8,
and when the encoding is not located in the static table, utf8Recode() is not
called at all.
I have added new record to the table manually and changed some
other functions so that it works now, but it's not the best thing we can
do.
Conclusions:
1. Encodings in Sablotron are handled differently in
different functions. This is caused by improper abstraction of encoding
functions. If all functions, constants and other stuff around encoding was
placed to one file and one class was used for all transformations, it would be
much easier to handle errors and add new encodings and translation
libraries.
2. Libiconv ability to translate dozens of encodings is not
used. Number of encodings is limited to the specified in the constant
table.
4. I understand, that such a large project requires lots of
other things to do and the problem I describe is not the most important. If I
get some time next year (I mean 2001), I might try to rearrange encoding
handling and offer you a more complete suggestionhow to do it.
5. I saw many pieces of code, whare encoding different from
UTF-8 were taken as exceptions. That means that after processing every entity,
we check many times if the encoding used is the "one we like or the one we
dislike". It decreases perfomance and adds very complicated C++ block trees to
many functions. To make things function predictably we could separate theese to
special classes.
Please don't consider it to be a bug description. It's just an
idea how to force things to perfection.
Sincerely Yours
Koscheev Andey
Eller s.r.o. |
- [Sab] Encodings... Koscheev Andrey
- Re: [Sab] Encodings... Pavel Hlavnicka
- Re: [Sab] Encodings... Koscheev Andrey
- Re: [Sab] Encodings... Tom Kaiser
