---Reply to mail from Bill Nalen about Parsing MS Word documents
> Okay, I spent a little time last night getting the parser to convert MS
> Word docs on the fly. Here is a diff for parser.py. Go easy since this
> is my first crack at Python and diff :-) You'll need to add the following
> to plucker.ini:
>
> worddoc_converter=c:\projects\wv\bin\wvware "%input%" > "%output%"
> the quotes are needed for Windows where the paths can have spaces in them.
> I'm using the wvWare conversion stuff. So far it seems to Pluck the Word
> documents from our internal servers just fine.
Bill:
A problem has come up with this approach:
No Images - if the document contains images, wvWare converts the images
OK, but, puts them in the current dir as "basename(output).png". If you
specify a dir ( /usr/loal/bin/wvWare -d somewere %somewhere/name.doc% >
%somewhere/name.html% ) then wvWare overwrites the images and all docs
'share' the same ones (because the name is the same).
I propose we put only the command in plucker.ini/.pluckerrc, like we
do for image converters (worddoc_converter = [path]wvware) and use something like:
check = os.path.basename (worddoc_converter)
(check, ext) = os.path.splitext (check)
if string.lower (check) == 'wvware':
to check if it is wvware (so other word converters could be used) and
build a wvware commandline for that.
tempbase = tempfile.mktemp()
tempdoc = os.path.join(tempfile.tempdir, tempbase + ".doc")
temphtml = os.path.join(tempfile.tempdir, tempbase + ".html")
command = worddoc_converter
command = command + " -d " + tempfile.tempdir + " -b " + tempbase
command = command + " " + tempdoc + " > " + temphtml
then all converted docs would have their own (correct) images. Cleaning up
the temp files is easily done just before Spider.py exits.
I've got this approach working and it seems to function well. We can
support other worddoc converters with a simple:
elif check == 'foo':
build commandline, execute and return PluckerTextDocument.
I can post the files if you want a closer look.
---End reply
Christopher R. Hawks
HAWKSoft
-------------------------------------------------------------------------
They say never to buy a "0" release of software.
Windows 2000 has 3 of 'em.
_______________________________________________
plucker-dev mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-dev