I've implemented the PGN export extensions because it is required for
writing proper PGN files. (Also later this extension will be used to write
C/CIF archives). I think that a more detailed description about the problem
and the solution of this problem is helpful:

-------------------------------------------------------------------------------------------
The actual database version 4.0 of Scid has still some weakness concerning the
internationalization, the data (player name, site name, event name, comments
inside move data, etc.) might be stored with any character set encoding,
depending on the following situations:

1. Older Linux/Unix distributions are installed with Latin-1 encoding as
default, and the strings has been stored with Latin-1 character set because
older Tcl libraries did not support Unicode.

2. Newer Linux/Unix distributions are installed with UTF-8 encoding as
default, this means that all strings will be stored with UTF-8 encoding.

3. Many applications have produced PGN files with unsuitable character
encodings (including Scid), it is not seldom that a PGN file has extended ASCII
(CP850 for example), or it is UTF-8 encoded, but without a leading UTF-8 BOM.
While importing PGN files Scid is interpreting the content as system encoded,
and this may result in defect encodings in such cases. Often the text content
of these games cannot be displayed correctly.

4. In some older databases the data was stored with Latin-1, and has changed 
in newer games to UTF-8 because of an upgrade of the Tcl library version.
The database is now a mix of different encodings.

5. The import of PGN files is interpreting Latin-1 as UTF-8, thus all data will
be stored as Latin-1 encoded.

6. Older Windoze versions have stored CP1252 encoded data, but newer Windoze
versions are storing UTF-8 (depending on the Tcl library version).

This has an impact on the export of PGN files, quite often the written data is
not properly Latin-1 encoded, and this is a violation of the PGN standard.
Moreover the Latin-1 character set can be unsuitable, for example when
exporting Russian comments to PGN with Latin-1 encoding the content will be
unreadable, in this case an export to UTF-8 is required. (PGN files with
UTF-8 encoding are not conform to the PGN standard, but ChessBase has
introduced the UTF-8 encoding with a leading UTF-8 BOM at the start of the file
 to mark the file content as UTF-8. This is now a de-facto standard, most
modern chess applications are supporting this extension.)

The newer version of Scid vs PC has introduced some enhancements for a proper
PGN export:

1. The user can choose between Latin-1 and UTF-8 encoding. Latin-1 will be in
general preferred, but in some cases, for example if exporting content with
Russian content, Latin-1 is unsuitable and UTF-8 should be used instead.

2. The export will be done with the use of a character set detector. This
detector tries to detect the character set of the exported text and is
converting the content either to Latin-1 or UTF-8, depending on the user's
choice. In many cases this detector is even able to convert the result of
defect encodings into a proper character set.

Please note that English -speaking countries are in general not affected by
these problems, the English characters are embedded in Latin-1, and thus also
in UTF-8, but nearly the rest of the world is affected.
-------------------------------------------------------------------------------------------

Gregor

------------------------------------------------------------------------------
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.
https://www.gigenetcloud.com/
_______________________________________________
Scidvspc-users mailing list
Scidvspc-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scidvspc-users

Reply via email to