[htdig] Processing pdf Files

Douglas Kline Tue, 20 Apr 2004 17:42:56 -0700

There have been several ways of including the text content of pdf files
mentioned in the list and in the documentation.  What are the pros and cons of
each?


Is there some advantage of xpdf version 3.00 over version 1.00?

doc2html.pl, pdf2html.pl, and pdftotext are all mentioned.  Is there an
advantage to one over another?  How about in comparison to other means?  Is
there some advantage to having more than one of these?

The http://www.htdig.org/contrib/ page on the Web site lists more possibilities
like acroconv.pl, conv_doc.pl, and parsepdf.pl.  What are their pros and cons?

The description of conv_doc.pl makes a distinction between parsing and
converting with the statement, "External converters have two advantages over
external parsers.  They are easier to write, and the parsing is done in a more
consistent way for all document types."  I'm not sure I understand this.  Does
the external parser do more than the external converter by doing some of what
htdig would do in searching for strings?  Would there be some efficiency
advantage to an external parser?  If an external converter parses "in a more
consistent way for all document types", then how is it different from an
external parser and what kind of inconsistencies might arise?  Wouldn't strings
be unambiguously identified in a pdf file by any of these tools?

TIA.

Douglas

========
Douglas Kline
[EMAIL PROTECTED]




-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
ht://Dig general mailing list: <[EMAIL PROTECTED]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

[htdig] Processing pdf Files

Reply via email to