[Sab] Encodings...

Koscheev Andrey Thu, 28 Dec 2000 09:03:42 -0800

Merry Christmas and welcome back to the list

This is quite deep technical stuff, so sysadmins and Perl programmers may skip it.

A few days ago I posted a note about encoding problems in Sablotron 0.44. The problem was not solved since I was waiting for the new version to come. However nothing has changed, so I have spent some time today investigating the sources and revealed some interesting things.

Just to remind the real problem:

Assume Sablotron compiled with libiconv on a Linux machine (I was testing 0.44, because 0.5 doesn't compile at all). After setting XML file encoding to "windows-1250", setting XSL file encoding to "windows-1250" and adding <xsl:output> PI with encoding "windows-1250" the following thing occurs: the result have HTML meta tag with encoding set to "windows-1250" but is encoded in UTF-8. Of course, browsers are happy about the META tag and the page looks very bizarre.

Sablotron checks if the encoding is correct using libiconv functions but it doesn't use encodings which are not listed in iconv_encoding array (utf8.cpp). Function OutputDefinition::getEncoding() defaults to UTF-8, and when the encoding is not located in the static table, utf8Recode() is not called at all.

I have added new record to the table manually and changed some other functions so that it works now, but it's not the best thing we can do.

Conclusions:

1. Encodings in Sablotron are handled differently in different functions. This is caused by improper abstraction of encoding functions. If all functions, constants and other stuff around encoding was placed to one file and one class was used for all transformations, it would be much easier to handle errors and add new encodings and translation libraries.

2. Libiconv ability to translate dozens of encodings is not used. Number of encodings is limited to the specified in the constant table.

4. I understand, that such a large project requires lots of other things to do and the problem I describe is not the most important. If I get some time next year (I mean 2001), I might try to rearrange encoding handling and offer you a more complete suggestionhow to do it.

5. I saw many pieces of code, whare encoding different from UTF-8 were taken as exceptions. That means that after processing every entity, we check many times if the encoding used is the "one we like or the one we dislike". It decreases perfomance and adds very complicated C++ block trees to many functions. To make things function predictably we could separate theese to special classes.

Please don't consider it to be a bug description. It's just an idea how to force things to perfection.

Sincerely Yours

Koscheev Andey

Eller s.r.o.

[Sab] Encodings...

Reply via email to