[HanoiLUG] Project OooVUniConv

Jean Christophe André Fri, 28 Mar 2008 17:28:33 +0700

Nguyen Vu Hung a ?crit :
> In the fist POC[1], the OP assumes that the input is a .doc file.
>   
The first step was only to show how to go from .DOC to .ODT.


The original project name was OOoVUniConv => OpenOffice,org's Vietnamese
to Unicode conversion.

I've then changed it to "ovniconv" => ODF's Vietnamese to Unicode
conversion.

> This .doc file will firstly opened with OOo, saved as .odt and after that, we 
> convert .odt file ( TCVN encoded ) to UTF-8.
>   
Yep. That's what I show in my first example. But you are right: I should
remove this step. ;-)

> .doc -> .UTF-8 .odt conversion helps when you want to convert thousands file 
> at a time. See win32com on howto open a .doc file and save it as .odt.
>   
Yep, but conversion from .DOC to ODF is a headache already managed by
OOo and I think it should not be re-done in the Python script except
using a call to some OOo library to do the job. Let's not re-invent the
wheel.

> If one wants to keep MS format, your tool won't be necessary :).
>   
Exactly.

Because this tool is about coming to the Open World the right way (using
ODF and Unicode).

It's certainly not about converting from Vietnamese to Unicode keeping
the Microsoft format all along.

> I repeat: Your tool should be able to convert  TCVN3-ed to odt, ( and 
> optional: TCVN3-ed .rtf to odt ).
>   
I'm sorry but I think it shouln't.

I understand you would like to see a very useful general conversion
tool, but I think we should have some more reasonable goal. Eg, parsing
RTF is totally another mater, it's not that easy as parsing XML.

I always try to follow the Unix philosophy: doing light tools doing only
one little thing but doing it as perfectly as possible!

> # conversion from .doc to .rtf is done easily with win32com.
>   
But parsing RTF then will be a headache. And I think there will be more
loss in DOC->RTF conversion than in DOC->ODF conversion, just because I
think RTF can not support all information stored in a .DOC (but may be
I'm wrong, I never checked this).

> I did mean we need to ensure that if a string is claimed to be TCVN3, it must 
> be TCVN3 ( not VNI or other encodings ). This is a rarely case and IMO, it 
> hardly happens.
>   
Once again, you are talking about rare case that should be managed
manually, IMHO. I want to do a tool, very light, very simple, that ?
Just Work ?? for the more general case. We'll deal with more specifics
ones later.

Or may be there could be some front tool that will do that checking and
correct the bad encoding before doing the recoding. It may then help if
I restructure my Python code to allow free encoding choice at conversion
call. I'll see that when I come back next week.

> Given a string x, is there any API in Python to "guess" what is its encode?[2]
>   
I don't know (yet)... This Python script is my second Python try! :-D
I have years of experience in other languages, but this one I'm just
starting to learn it. ;-)

> I see. So the chance we get a false-positive with encoding detection is low?
>   
If your document can be read correctly using the old vietnamese fonts,
then there is no chance (never say never?) to get a bad encoding issue,
IMHO.

-- 
Jean Christophe "????" ANDR? ? Responsable technique r?gional
Bureau Asie-Pacifique (BAP) ? http://asie-pacifique.auf.org/
Agence universitaire de la Francophonie (AuF) ? http://www.auf.org/
Adresse postale : AUF, 21 L? Th?nh T?ng, T.T. Ho?n Ki?m, H? N?i, Vi?t Nam
T?l. : +84 4 9331108   Fax : +84 4 8247383   Mobile : +84 91 3248747
? Note personnelle : merci d'?viter de m'envoyer des fichiers PowerPoint  ?
? ou Word, voir http://www.gnu.org/philosophy/no-word-attachments.fr.html ?


-------------- section suivante --------------
Une pi?ce jointe non texte a ?t? nettoy?e...
Nom: signature.asc
Type: application/pgp-signature
Taille: 252 octets
Desc: OpenPGP digital signature
Url: 
http://lists.hanoilug.org/pipermail/hanoilug/attachments/20080328/6116013a/attachment.pgp

[HanoiLUG] Project OooVUniConv

Trả lời cho