Janine Sisk wrote:

I'm working with a site that stores it's content in big5, and is run through a conversion program to create a gb2312 version for those who prefer the simplified characters. I know these are the charsets being used; I've seen the config files for the converter. Unfortunately the converter was written by a Chinese company with no English info available, does not appear in Google, and is no longer supported even by the original authors. So basically I have to write my own program to do what it does, without any info on how it does what it does.

I haven't dealt with chinese characters at all, but this sounds like you're doing character set translations, not character encoding conversions. tcl's 'encoding' command won't help you here - you'd need a monster "string map" command to change all 6000? code points from one into the other. To draw a much simplified analogy, this is like translating cp1252 to iso8859-1 - you can't do it by simply changing the encoding, you must translate the character set from one to the other by mapping the characters that do not appear in the target character set (in the case of cp1252->iso8859-1 you might map both the left and right single quotes to an apostrophe)


The only conversion that works with the java program is to go utf-8 to utf-8s, which it calls simplified utf-8. Google tells me that this is a bastardized format of sorts, proposed by Oracle and not widely accepted. Unfortunately it is, so far, the only one that works. Data comes in as utf-8, gets converted to utf-8s, and goes out through AOLserver configured to use utf-8. All is well.

I think simplified utf-8 is the same as regular utf-8 for all code points < U+10000 (i.e., a single ucs-16 character, which is java's native format for it). So if your encodings are all beneath that you can call it utf-8 without issue.

The problem is, Tcl doesn't support utf-8s, and as far as I can tell there is no other format that will work. This will leave me stuck with the java program, and I have serious concerns about the performance of any sort of exec, let alone one that involves writing files.

It sounds like the java program is your best bet since it does the translation already; do you have the source to the java program? You might be able to modify it to run better in a pipe, or by being a persistent process so you avoid the fork/exec overhead on every run (e.g., by running it inside tomcat as someone else suggested). If you're really adventurous you could try getting it to run under tcljava but I have no idea if that even works inside aolserver.

-J


--
AOLserver - http://www.aolserver.com/

To Remove yourself from this list, simply send an email to <[EMAIL PROTECTED]> 
with the
body of "SIGNOFF AOLSERVER" in the email message. You can leave the Subject: 
field of your email blank.

Reply via email to