Andrey,
encoding problems are a well known problem (bug reported is quite new
for us).
We've collected many patches and notes related to the encoding problems.
The better solution was scheduled to the current release, but we had to
cancel it in the last minute.
Now is the encoding at the top of our TODO list.
Tom has to give you more exact answer to all your questions and
proposals, please be so nice and wait few days.
Regrads
Pavel
Koscheev Andrey wrote:
>
>
> Merry Christmas
>
>
>
> This is quite deep technical stuff, so sysadmins and Perl programmers
> may skip it.
>
>
>
>
>
> A few days ago I posted a note about encoding problems in Sablotron
> 0.44. The problem was not solved since I was waiting for the new version
> to come. However nothing has changed, so I have spent some time today
> investigating the sources and revealed some interesting things.
>
>
>
> Just to remind the real problem:
>
> Assume Sablotron compiled with libiconv on a Linux machine (I was
> testing 0.44, because 0.5 doesn't compile at all). After setting XML
> file encoding to "windows-1250", setting XSL file encoding to
> "windows-1250" and adding <xsl:output> PI with encoding "windows-1250"
> the following thing occurs: the result have HTML meta tag with encoding
> set to "windows-1250" but is encoded in UTF-8. Of course, browsers are
> happy about the META tag and the page looks very bizarre.
>
>
>
> Sablotron checks if the encoding is correct using libiconv functions but
> it doesn't use encodings which are not listed in iconv_encoding array
> (utf8.cpp). Function OutputDefinition::getEncoding() defaults to UTF-8,
> and when the encoding is not located in the static table, utf8Recode()
> is not called at all.
>
>
>
> I have added new record to the table manually and changed some other
> functions so that it works now, but it's not the best thing we can do.
>
>
>
> Conclusions:
>
> 1. Encodings in Sablotron are handled differently in different
> functions. This is caused by improper abstraction of encoding functions.
> If all functions, constants and other stuff around encoding was placed
> to one file and one class was used for all transformations, it would be
> much easier to handle errors and add new encodings and translation
> libraries.
>
> 2. Libiconv ability to translate dozens of encodings is not used. Number
> of encodings is limited to the specified in the constant table.
>
> 4. I understand, that such a large project requires lots of other things
> to do and the problem I describe is not the most important. If I get
> some time next year (I mean 2001), I might try to rearrange encoding
> handling and offer you a more complete suggestionhow to do it.
>
> 5. I saw many pieces of code, whare encoding different from UTF-8 were
> taken as exceptions. That means that after processing every entity, we
> check many times if the encoding used is the "one we like or the one we
> dislike". It decreases perfomance and adds very complicated C++ block
> trees to many functions. To make things function predictably we could
> separate theese to special classes.
>
>
>
>
>
> Please don't consider it to be a bug description. It's just an idea how
> to force things to perfection.
>
>
>
> Sincerely Yours
>
>
>
> Koscheev Andey
>
> Eller s.r.o.
--
Pavel Hlavnicka
Ginger Alliance Ltd.
Prague; Czech Republic