Folks,

  I've modified the parser to look for character set information, and
add it to the Plucker DB, in the manner that we dicussed a few weeks
ago.  If a default character set is detected, it creates a record of a
new kind (METADATA), and writes the IANA MIBenum for that charset in
the new record.  If various pages in the document use character sets
other than the default, it makes up a list of them, and writes that
list to the METADATA record.  If no default character set is
specified, none is assumed.  I've currently only got the code in there
to figure out the default; I'm working on the code to figure out the
charset for individual pages.

Figuring out the default:

  If a default is specified on the command line, via '--charset=FOO',
it is used as the default.  (FOO can be either a charset name from a
small set, or a decimal number from the set enumerated at
http://www.iana.org/assignments/character-sets.)  Failing that, the
"default_charset" tag in the config file will set it.  If neither of
these is set, the parser uses the Python 'locale' module to check for
a default, via

  locale.setlocale(locale.LC_ALL, "")
  encoding = locale.getlocale()[1]

If that also fails to produce a charset, the default charset is left
unassigned.  (Using a Solaris 2.6 machine in California, the standard
operation for me is to have no default charset.)  Note that on POSIX
machines, the locale setting can be manipulated via the LANG
environment variable.

I'm uncertain of what to on Windows machines; I'm experimenting with
one right now to see what kind of possibilities there are.

The new METADATA record type:

  I figured that it would be generally useful to have a record type
that could be extensibly used to store small amounts of info (like a
two-byte charset indicator), so the new record type holds a sequence
of name-value pairs, where the 'name' is a 2-byte code, and the value
is a counted sequence of bytes.  This is documented in the format doc,
at http://www.plkr.org/index.pl/cvs/docs/DBFormat.html?rev=HEAD.

The default is to have only one of these records per document, and it
uses record ID 5.  There's a pointer to it in the index record's
'reserved record' list.  The viewer doesn't yet know to do anything
with it.

The intent is that other small amounts of data could also be put in
this record in the future without harm, as needed.  Things like the
version of the parser that generated the document, or an ISBN for the
document, etc.

Bill

Reply via email to