Re: xpdf 0.90 announcement (was Re: [htdig] parse_doc.pl slow)

Gilles Detillieux Thu, 12 Aug 1999 11:26:35 -0700


According to Frank Guangxin Liu:
> Here is how I tested it:
> pdftotext.old -rawdump test.pdf
> grep F_Table test.txt
> can't find any match. (F_Table is a word in the landscape table
>                        on Page 54 of 72).
> 
> pdftotext.new -raw test.pdf
> grep F_Table test.txt
> found the match!!
> 
> I understand the "test.txt" generated from the new pdftotext
> still looks funny (unformated) for those landscape tables
> (Page 48 and beyond), but at least it has all the words in
> there which is all htdig cares.

But not all the words are intact.  Here's an example of pdftotext output
from the PDF you gave me:

  Co
mpliance wit
h QS
P 1-
02, Pro
tection of Pro
prietary Interests,
 is re
quired. Info
rmation contained with
in this d
ocument or generated as a result thereof is no
t to be disclosed to third partie
s

Most of the words are intact, but a lot of them wrap onto another line,
so htdig treats the two parts as separate words.  Yes, it's a lot better
than what you'd get with pdftotext 0.80, with my rawdump patch, but is it
as good as what you'd get from htdig's parsing of acroread's PostScript
output?

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.

Re: xpdf 0.90 announcement (was Re: [htdig] parse_doc.pl slow)

Reply via email to