2008/3/28, Jean Christophe Andr? <jean-christophe.andre at auf.org>: > Nguyen Vu Hung a ?crit : > > > The users have 2 choices: > > 1. Convert TCVN3 encoded MS .doc file into UTF-8 encoded .odt > > 2. Convert an .odt with TCVN3 encoding into UTF-8 encoed .odt. > > > > No. On my side, they have only one choice: convert an ODF document. > > This tool doesn't care (because I don't want to) about MS formats. In the fist POC[1], the OP assumes that the input is a .doc file. This .doc file will firstly opened with OOo, saved as .odt and after that, we convert .odt file ( TCVN encoded ) to UTF-8.
.doc -> .UTF-8 .odt conversion helps when you want to convert thousands file at a time. See win32com on howto open a .doc file and save it as .odt. > It's a political choice: I will definitely help Vietnam to move to OOo > and Unicode by the way, but I will *not* help keeping using Microsoft > formats. If one wants to keep MS format, your tool won't be necessary :). I repeat: Your tool should be able to convert TCVN3-ed to odt, ( and optional: TCVN3-ed .rtf to odt ). # conversion from .doc to .rtf is done easily with win32com. > I'm not sure what kind of auto-detection you are talking about... > > Auto-detection is not always doable, depending of the kind of difference > between the encodings, or it may because truly hard if the encodings use > the same coding rules (like TCVN-5712-1 and ISO-8859-1, using raw 8 bits > to code 256 characters) => you'll have to guess the encoding using > pattern recognition (eg: words from a dictionary). > I did mean we need to ensure that if a string is claimed to be TCVN3, it must be TCVN3 ( not VNI or other encodings ). This is a rarely case and IMO, it hardly happens. Given a string x, is there any API in Python to "guess" what is its encode?[2] > But I think we need that here... Rules are simple: ".VnTimes" uses > TCVN-5712 encoding (declaring it wrongly as CP1252), "VNITimes" uses VNI > encoding and "Times New Roman" uses Unicode encoding. So it's quite easy > to recognize the "encoding" here. I see. So the chance we get a false-positive with encoding detection is low? [1] http://www.hanoilug.org/dokuwiki/projects:ovniconv [2] I am not a Python expert, but I see this feature in Perl's Encode::Guess http://perldoc.perl.org/Encode/Guess.html -- Best Regards, Nguyen Hung Vu ( Nguy?n V? H?ng ) vuhung16plus{[email protected] An inquisitive look at Harajuku http://www.flickr.com/photos/vuhung/sets/72157600109218238/
