2008/3/28, Jean Christophe Andr? <jean-christophe.andre at auf.org>:
> Nguyen Vu Hung a ?crit :
>
> > The users have 2 choices:
>  > 1. Convert TCVN3 encoded MS .doc file into UTF-8 encoded .odt
>  > 2. Convert an .odt with TCVN3 encoding into UTF-8 encoed .odt.
>  >
>
> No. On my side, they have only one choice: convert an ODF document.
>
>  This tool doesn't care (because I don't want to) about MS formats.
In the fist POC[1], the OP assumes that the input is a .doc file.
This .doc file will firstly opened with OOo, saved as .odt and after that,
we convert .odt file ( TCVN encoded ) to UTF-8.

.doc -> .UTF-8 .odt conversion helps when you want to convert thousands file
at a time. See win32com on howto open a .doc file and save it as .odt.

>  It's a political choice: I will definitely help Vietnam to move to OOo
>  and Unicode by the way, but I will *not* help keeping using Microsoft
>  formats.
If one wants to keep MS format, your tool won't be necessary :).

I repeat: Your tool should be able to convert  TCVN3-ed to odt,
( and optional: TCVN3-ed .rtf to odt ).

# conversion from .doc to .rtf is done easily with win32com.

> I'm not sure what kind of auto-detection you are talking about...
>
>  Auto-detection is not always doable, depending of the kind of difference
>  between the encodings, or it may because truly hard if the encodings use
>  the same coding rules (like TCVN-5712-1 and ISO-8859-1, using raw 8 bits
>  to code 256 characters) => you'll have to guess the encoding using
>  pattern recognition (eg: words from a dictionary).
>
I did mean we need to ensure that if a string is claimed to be TCVN3,
it must be TCVN3 ( not VNI or other encodings ).
This is a rarely case and IMO, it hardly happens.

Given a string x, is there any API in Python to "guess" what is its encode?[2]

>  But I think we need that here... Rules are simple: ".VnTimes" uses
>  TCVN-5712 encoding (declaring it wrongly as CP1252), "VNITimes" uses VNI
>  encoding and "Times New Roman" uses Unicode encoding. So it's quite easy
>  to recognize the "encoding" here.
I see. So the chance we get a false-positive with encoding detection is low?

[1] http://www.hanoilug.org/dokuwiki/projects:ovniconv
[2] I am not a Python expert, but I see this feature in Perl's Encode::Guess
      http://perldoc.perl.org/Encode/Guess.html

-- 
Best Regards,
Nguyen Hung Vu ( Nguy?n V? H?ng )
vuhung16plus{[email protected]
An inquisitive look at Harajuku
http://www.flickr.com/photos/vuhung/sets/72157600109218238/

Trả lời cho