Re: Parsing MS Word documents

Bill Nalen Wed, 05 Feb 2003 08:19:49 -0800

Okay, I spent a little time last night getting the parser to convert MS Word docs on the fly. Here is a diff for parser.py. Go easy since this is my first crack at Python and diff :-) You'll need to add the following to plucker.ini:

worddoc_converter=c:\projects\wv\bin\wvware "%input%" > "%output%"
the quotes are needed for Windows where the paths can have spaces in them. I'm using the wvWare conversion stuff. So far it seems to Pluck the Word documents from our internal servers just fine.

Bill

10a11
> import os, sys, string
45a47,91
> elif type[:18] == "application/msword":
> # retrieve config information
> pluckerhomedir = config.get_string('PLUCKERHOME')
> if pluckerhomedir is None:
> message(0, "Could not find PLUCKERHOME for Word conversion")
> return None
> converttemplate = config.get_string('worddoc_converter')
> if converttemplate is None:
> message(0, "Could not find Word conversion command")
> return None
>
> # need to save data to a local file
> tempfilenamedoc = os.path.join(pluckerhomedir, "wordtemp.doc")
> try:
> file = open (tempfilenamedoc, "wb")
> file.write (data)
> file.close ()
> except IOError, text:
> message(0, "Error saving temporary file %s" % tempfilenamedoc)
> return None
>
> # then run wvware on it > local.html
> tempfilenamehtml = os.path.join(pluckerhomedir, "wordtemp.html")
> command = converttemplate
> command = string.replace (command, '%input%', tempfilenamedoc)
> command = string.replace (command, '%output%', tempfilenamehtml)
> try:
> if os.system (command):
> message(0, "Error running Word converter %s" % command)
> return None
> except:
> message(0, "Exception running word converter %s" % command)
> return None
>
> # then load the local.html file to data2
> try:
> file = open (tempfilenamehtml, "rb")
> data2 = file.read ()
> file.close ()
> except IOError, text:
> message(0, "Error reading temporary file %s" % tempfilenamehtml)
> return None
> # then create a structuredhtmlparser from data2
> parser = TextParser.StructuredHTMLParser (url, data2, headers, config, attributes)
> return parser.get_plucker_doc ()

Re: Parsing MS Word documents

Reply via email to