Nguyen Vu Hung a ?crit : > In the fist POC[1], the OP assumes that the input is a .doc file. > The first step was only to show how to go from .DOC to .ODT.
The original project name was OOoVUniConv => OpenOffice,org's Vietnamese to Unicode conversion. I've then changed it to "ovniconv" => ODF's Vietnamese to Unicode conversion. > This .doc file will firstly opened with OOo, saved as .odt and after that, we > convert .odt file ( TCVN encoded ) to UTF-8. > Yep. That's what I show in my first example. But you are right: I should remove this step. ;-) > .doc -> .UTF-8 .odt conversion helps when you want to convert thousands file > at a time. See win32com on howto open a .doc file and save it as .odt. > Yep, but conversion from .DOC to ODF is a headache already managed by OOo and I think it should not be re-done in the Python script except using a call to some OOo library to do the job. Let's not re-invent the wheel. > If one wants to keep MS format, your tool won't be necessary :). > Exactly. Because this tool is about coming to the Open World the right way (using ODF and Unicode). It's certainly not about converting from Vietnamese to Unicode keeping the Microsoft format all along. > I repeat: Your tool should be able to convert TCVN3-ed to odt, ( and > optional: TCVN3-ed .rtf to odt ). > I'm sorry but I think it shouln't. I understand you would like to see a very useful general conversion tool, but I think we should have some more reasonable goal. Eg, parsing RTF is totally another mater, it's not that easy as parsing XML. I always try to follow the Unix philosophy: doing light tools doing only one little thing but doing it as perfectly as possible! > # conversion from .doc to .rtf is done easily with win32com. > But parsing RTF then will be a headache. And I think there will be more loss in DOC->RTF conversion than in DOC->ODF conversion, just because I think RTF can not support all information stored in a .DOC (but may be I'm wrong, I never checked this). > I did mean we need to ensure that if a string is claimed to be TCVN3, it must > be TCVN3 ( not VNI or other encodings ). This is a rarely case and IMO, it > hardly happens. > Once again, you are talking about rare case that should be managed manually, IMHO. I want to do a tool, very light, very simple, that ? Just Work ?? for the more general case. We'll deal with more specifics ones later. Or may be there could be some front tool that will do that checking and correct the bad encoding before doing the recoding. It may then help if I restructure my Python code to allow free encoding choice at conversion call. I'll see that when I come back next week. > Given a string x, is there any API in Python to "guess" what is its encode?[2] > I don't know (yet)... This Python script is my second Python try! :-D I have years of experience in other languages, but this one I'm just starting to learn it. ;-) > I see. So the chance we get a false-positive with encoding detection is low? > If your document can be read correctly using the old vietnamese fonts, then there is no chance (never say never?) to get a bad encoding issue, IMHO. -- Jean Christophe "????" ANDR? ? Responsable technique r?gional Bureau Asie-Pacifique (BAP) ? http://asie-pacifique.auf.org/ Agence universitaire de la Francophonie (AuF) ? http://www.auf.org/ Adresse postale : AUF, 21 L? Th?nh T?ng, T.T. Ho?n Ki?m, H? N?i, Vi?t Nam T?l. : +84 4 9331108 Fax : +84 4 8247383 Mobile : +84 91 3248747 ? Note personnelle : merci d'?viter de m'envoyer des fichiers PowerPoint ? ? ou Word, voir http://www.gnu.org/philosophy/no-word-attachments.fr.html ? -------------- section suivante -------------- Une pi?ce jointe non texte a ?t? nettoy?e... Nom: signature.asc Type: application/pgp-signature Taille: 252 octets Desc: OpenPGP digital signature Url: http://lists.hanoilug.org/pipermail/hanoilug/attachments/20080328/6116013a/attachment.pgp
