Okay, I spent a little time last night getting the parser to convert MS Word docs on the fly. Here is a diff for parser.py. Go easy since this is my first crack at Python and diff :-) You'll need to add the following to plucker.ini:
worddoc_converter=c:\projects\wv\bin\wvware "%input%" > "%output%"
the quotes are needed for Windows where the paths can have spaces in them. I'm using the wvWare conversion stuff. So far it seems to Pluck the Word documents from our internal servers just fine.
Bill
10a11
> import os, sys, string
45a47,91
> elif type[:18] == "application/msword":
> # retrieve config information
> pluckerhomedir = config.get_string('PLUCKERHOME')
> if pluckerhomedir is None:
> message(0, "Could not find PLUCKERHOME for Word conversion")
> return None
> converttemplate = config.get_string('worddoc_converter')
> if converttemplate is None:
> message(0, "Could not find Word conversion command")
> return None
>
> # need to save data to a local file
> tempfilenamedoc = os.path.join(pluckerhomedir, "wordtemp.doc")
> try:
> file = open (tempfilenamedoc, "wb")
> file.write (data)
> file.close ()
> except IOError, text:
> message(0, "Error saving temporary file %s" % tempfilenamedoc)
> return None
>
> # then run wvware on it > local.html
> tempfilenamehtml = os.path.join(pluckerhomedir, "wordtemp.html")
> command = converttemplate
> command = string.replace (command, '%input%', tempfilenamedoc)
> command = string.replace (command, '%output%', tempfilenamehtml)
> try:
> if os.system (command):
> message(0, "Error running Word converter %s" % command)
> return None
> except:
> message(0, "Exception running word converter %s" % command)
> return None
>
> # then load the local.html file to data2
> try:
> file = open (tempfilenamehtml, "rb")
> data2 = file.read ()
> file.close ()
> except IOError, text:
> message(0, "Error reading temporary file %s" % tempfilenamehtml)
> return None
> # then create a structuredhtmlparser from data2
> parser = TextParser.StructuredHTMLParser (url, data2, headers, config, attributes)
> return parser.get_plucker_doc ()
- Re: Parsing MS Word documents Dave Maddock
- Re: Parsing MS Word documents David A. Desrosiers
- Re: Parsing MS Word documents Bill Nalen
- Re: Parsing MS Word documents Chris Hawks
- Re: Parsing MS Word documents Dave Maddock
- Re: Parsing MS Word documents David A. Desrosiers
- Re: Parsing MS Word documents Dave Maddock
- Re: Parsing MS Word documents Bill Janssen
- Re: Parsing MS Word documents Bill Nalen
- Re: Parsing MS Word documents Bill Janssen
- Re: Parsing MS Word documents Bill Nalen
- Re: Parsing MS Word documents Chris Hawks
- Re: Parsing MS Word documents Bill Nalen
- Re: Parsing MS Word documents Chris Hawks
- Re: Parsing MS Word documents Bill Nalen
- Re: Parsing MS Word documents Chris Hawks
- Re: Parsing MS Word documents Bill Janssen
- Re: Parsing MS Word documents Bill Janssen
- Re: Parsing MS Word documents Bill Janssen
- Re: Parsing MS Word documents Chris Hawks
