Folks, I've modified the parser to look for character set information, and add it to the Plucker DB, in the manner that we dicussed a few weeks ago. If a default character set is detected, it creates a record of a new kind (METADATA), and writes the IANA MIBenum for that charset in the new record. If various pages in the document use character sets other than the default, it makes up a list of them, and writes that list to the METADATA record. If no default character set is specified, none is assumed. I've currently only got the code in there to figure out the default; I'm working on the code to figure out the charset for individual pages.
Figuring out the default: If a default is specified on the command line, via '--charset=FOO', it is used as the default. (FOO can be either a charset name from a small set, or a decimal number from the set enumerated at http://www.iana.org/assignments/character-sets.) Failing that, the "default_charset" tag in the config file will set it. If neither of these is set, the parser uses the Python 'locale' module to check for a default, via locale.setlocale(locale.LC_ALL, "") encoding = locale.getlocale()[1] If that also fails to produce a charset, the default charset is left unassigned. (Using a Solaris 2.6 machine in California, the standard operation for me is to have no default charset.) Note that on POSIX machines, the locale setting can be manipulated via the LANG environment variable. I'm uncertain of what to on Windows machines; I'm experimenting with one right now to see what kind of possibilities there are. The new METADATA record type: I figured that it would be generally useful to have a record type that could be extensibly used to store small amounts of info (like a two-byte charset indicator), so the new record type holds a sequence of name-value pairs, where the 'name' is a 2-byte code, and the value is a counted sequence of bytes. This is documented in the format doc, at http://www.plkr.org/index.pl/cvs/docs/DBFormat.html?rev=HEAD. The default is to have only one of these records per document, and it uses record ID 5. There's a pointer to it in the index record's 'reserved record' list. The viewer doesn't yet know to do anything with it. The intent is that other small amounts of data could also be put in this record in the future without harm, as needed. Things like the version of the parser that generated the document, or an ISBN for the document, etc. Bill
