Start by using the utilities that you have already got:
Don't bother with wp2html for Word 2000, use catdoc
Use the pdf2html.pl wrapper script with pdftotext and pdfinfo
Then go to the www.xlHtml.org site and download xlhtml (pptHtml is part of
the download).
Later, when you are happy with the job htdig and the converters are doing:
Upgrade to pdftotext and pdfinfo to xpdf v1.0 if you havn't already,
it's wellworth the trouble.
Consider purchasing wp2html to give you improved indexing of Word 2000
documents.
Download the swfparser code and install with the swf2html.pl wrapper
script.
Note that swfparser does NOT extract text from Shockwave Flash files, only
links.
So you cannot index them, but it may be important on some sites to be able
to follow
the links which are embedded in them.
--
David Adams
Computing Services
Southampton University
----- Original Message -----
From: "Steve Burton" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Thursday, April 11, 2002 9:28 AM
Subject: [htdig] Recommended parser set
> Hi,
>
> I'm just starting using htdig (3.1.6) to index our new company intranet
> and it works (it's brilliant, in fact but enough crawling)!
>
> At the moment I'm using conv_doc.pl with catdoc, pdftotext and pdfinfo
> as external parsers but I would like to extend the number of document
> types I can handle. I downloaded doc2html and read the docs. and now I'm
> confused (too much choice). Can anyone recommend a parser set that
> works? My priorities are Word 2000, PDF, Excel, PowerPoint and Flash
> (with Flash very low on my list.
>
> Thanks,
>
> Steve.
>
>
> _______________________________________________
> htdig-general mailing list <[EMAIL PROTECTED]>
> To unsubscribe, send a message to
<[EMAIL PROTECTED]> with a subject of unsubscribe
> FAQ: http://htdig.sourceforge.net/FAQ.html
>
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html