It is a good idea. But what happens when charsets are mixed in one single page of the document? This is not the stuff you can always avoid.
How is it to just have <cs></cs> tags within the encoded pages? By this way, we can display multi-language doc without any problem. Zailong --- Bill Janssen <[EMAIL PROTECTED]> wrote: > Folks, > > I've modified the parser to look for character set information, and > add it to the Plucker DB, in the manner that we dicussed a few weeks > ago. If a default character set is detected, it creates a record of a > new kind (METADATA), and writes the IANA MIBenum for that charset in > the new record. If various pages in the document use character sets > other than the default, it makes up a list of them, and writes that > list to the METADATA record. If no default character set is > specified, none is assumed. I've currently only got the code in there > to figure out the default; I'm working on the code to figure out the > charset for individual pages. > > Figuring out the default: > > If a default is specified on the command line, via '--charset=FOO', > it is used as the default. (FOO can be either a charset name from a > small set, or a decimal number from the set enumerated at > http://www.iana.org/assignments/character-sets.) Failing that, the > "default_charset" tag in the config file will set it. If neither of > these is set, the parser uses the Python 'locale' module to check for > a default, via > > locale.setlocale(locale.LC_ALL, "") > encoding = locale.getlocale()[1] > > If that also fails to produce a charset, the default charset is left > unassigned. (Using a Solaris 2.6 machine in California, the standard > operation for me is to have no default charset.) Note that on POSIX > machines, the locale setting can be manipulated via the LANG > environment variable. > > I'm uncertain of what to on Windows machines; I'm experimenting with > one right now to see what kind of possibilities there are. > > The new METADATA record type: > > I figured that it would be generally useful to have a record type > that could be extensibly used to store small amounts of info (like a > two-byte charset indicator), so the new record type holds a sequence > of name-value pairs, where the 'name' is a 2-byte code, and the value > is a counted sequence of bytes. This is documented in the format doc, > at http://www.plkr.org/index.pl/cvs/docs/DBFormat.html?rev=HEAD. > > The default is to have only one of these records per document, and it > uses record ID 5. There's a pointer to it in the index record's > 'reserved record' list. The viewer doesn't yet know to do anything > with it. > > The intent is that other small amounts of data could also be put in > this record in the future without harm, as needed. Things like the > version of the parser that generated the document, or an ISBN for the > document, etc. > > Bill > __________________________________________________ Do You Yahoo!? Make a great connection at Yahoo! Personals. http://personals.yahoo.com
