At 4:12 PM -0400 8/20/01, Steele, David S. wrote:
>htdig uses parse_doc.pl to call pdftotext, and this guy does a lot to the
>output of the pdf parser. I don't see why these modifications are needed.
The parse_doc script is an external parser, rather than an external
converter and as such must mark up the output to match that expected
by the ExternalParser code:
<http://www.htdig.org/attrs.html#external_parsers>
More useful for your purposes would be the conv_doc.pl script, which
is an external converter.
>- If pdftohtml produces a single clean html file from PDF input, can I make
>this work by just identifying it as the external_handler for pdf (with
>appropriate arguments)?
Not quite, but close.
>- Does something else have to be done to tell htdig to process the file as
>html, with links?
See the description of the external converter features in the
documentation URL above.
--
--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html