On 28 Okt., 15:25, [EMAIL PROTECTED] wrote: > All, > > I am trying to write a script that will parse and extract data from a > MS Word document. Can / would anyone refer me to a tutorial on how to > do that? (perhaps from tables). I am aware of, and have downloaded > the pywin32 extensions, but am unsure of how to proceed -- I'm not > familiar with the COM API for word, so help for that would also be > welcome. > > Any help would be appreciated. Thanks for your attention and > patience. > > ::bp::
One can convert MS-Word documents into some class of XML documents called MHTML. If I remember correctly those documents had an .mht extension. The result is a huge amount of ( nevertheless structured ) markup gibberish together with text. If one spends time and attention one can find pattern in the markup ( we have XML and it's human readable ). A few years ago I used this conversion to implement roughly following thing algorithm: 1. I manually highlighted one or more sections in a Word doc using a background colour marker. 2. I searched for the colour marked section and determined the structure. The structure information was fed into a state machine. 3. With this state machine I searched for all sections that were equally structured. 4. I applied a href link to the text that was surrounded by the structure and removed the colour marker. 5. In another document I searched for the same text and set an anchor. This way I could link two documents ( those were public specifications being originally disconnected ). Kay -- http://mail.python.org/mailman/listinfo/python-list