Hi Andrey,
this is perfectly right, I agree will all the points you make. As I have
said several times on this list, the encoding functions will be
completely rewritten; their present state is just an initial attempt to
deal with the problem. We have 5 or 6 contributions adding support for
different encodings, mostly by different means. We'd like to find a
uniform, flexible solution to the problem, not another makeshift fix.
This is rather complex and the whole thing drags somewhat. But we'd be
happy to get this done ASAP.
As for one obvious change, the encoding enum (constant table) will be
best disposed of. We'll use the strings understood by iconv to identify
the encodings.
We'll appreciate any further suggestions or comments on encodings.
Tom
Koscheev Andrey wrote:
> Merry Christmas and welcome back to the list
>
>
>
> This is quite deep technical stuff, so sysadmins and Perl programmers
> may skip it.
>
>
>
>
>
> A few days ago I posted a note about encoding problems in Sablotron
> 0.44. The problem was not solved since I was waiting for the new
> version to come. However nothing has changed, so I have spent some
> time today investigating the sources and revealed some interesting things.
>
>
>
> Just to remind the real problem:
>
> Assume Sablotron compiled with libiconv on a Linux machine (I was
> testing 0.44, because 0.5 doesn't compile at all). After setting XML
> file encoding to "windows-1250", setting XSL file encoding to
> "windows-1250" and adding <xsl:output> PI with encoding "windows-1250"
> the following thing occurs: the result have HTML meta tag with
> encoding set to "windows-1250" but is encoded in UTF-8. Of course,
> browsers are happy about the META tag and the page looks very bizarre.
>
>
>
> Sablotron checks if the encoding is correct using libiconv functions
> but it doesn't use encodings which are not listed in iconv_encoding
> array (utf8.cpp). Function OutputDefinition::getEncoding() defaults to
> UTF-8, and when the encoding is not located in the static table,
> utf8Recode() is not called at all.
>
>
>
> I have added new record to the table manually and changed some other
> functions so that it works now, but it's not the best thing we can do.
>
>
>
> Conclusions:
>
> 1. Encodings in Sablotron are handled differently in different
> functions. This is caused by improper abstraction of encoding
> functions. If all functions, constants and other stuff around encoding
> was placed to one file and one class was used for all transformations,
> it would be much easier to handle errors and add new encodings and
> translation libraries.
>
> 2. Libiconv ability to translate dozens of encodings is not used.
> Number of encodings is limited to the specified in the constant table.
>
> 4. I understand, that such a large project requires lots of other
> things to do and the problem I describe is not the most important. If
> I get some time next year (I mean 2001), I might try to rearrange
> encoding handling and offer you a more complete suggestionhow to do it.
>
> 5. I saw many pieces of code, whare encoding different from UTF-8 were
> taken as exceptions. That means that after processing every entity, we
> check many times if the encoding used is the "one we like or the one
> we dislike". It decreases perfomance and adds very complicated C++
> block trees to many functions. To make things function predictably we
> could separate theese to special classes.
>
>
>
>
>
> Please don't consider it to be a bug description. It's just an idea
> how to force things to perfection.
>
>
>
> Sincerely Yours
>
>
>
> Koscheev Andey
>
> Eller s.r.o.
>