According to Neil Brewer: > I'm using htdig with pdftotext and parse_doc.pl to index all of my pdf > files. However, with one paticular pdf file, i get the following error: > > [root pdf]# pdftotext ts.pdf > Error: Unknown Type 0 character set: Adobe-Identity > Error: Unknown Type 0 character set: Adobe-Identity > Error: Unknown Type 0 character set: Adobe-Identity > > And the generated ts.txt looks as follows: > > -rw-rw-r-- 1 root httpd 3426 Jan 15 16:27 ts.txt > > But, if I 'cat ts.txt' it shows no data. Furthermore, pdftohtml does extract > the images, but not the text and I get the same error. pdftops produces the > same error 6 times. I can copy and paste the text if i open it in Acroread > for windows. So....any ideas? I can't find information on this anywhere.
Well, it looks like a problem with that particular PDF file, or the xpdf package, or both. Either way, it's not an htdig problem and there isn't much we can do about it. Make sure you have the latest version of xpdf (from http://www.foolabs.com/xpdf/), and if the problem persists and you know the PDF file is correct, then you may want to bring up the matter with xpdf's author. On an unrelated note, you may want to consider abandoning parse_doc.pl in favour of conv_doc.pl or doc2html.pl. It won't solve this problem, but it may help with other problems you're likely to run into. See http://www.htdig.org/FAQ.html#q4.9 -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

