According to Neil Brewer:
> I'm using htdig with pdftotext and parse_doc.pl to index all of my pdf
> files. However, with one paticular pdf file, i get the following error:
> 
> [root pdf]# pdftotext ts.pdf
> Error: Unknown Type 0 character set: Adobe-Identity
> Error: Unknown Type 0 character set: Adobe-Identity
> Error: Unknown Type 0 character set: Adobe-Identity
> 
> And the generated ts.txt looks as follows:
> 
> -rw-rw-r--   1 root     httpd        3426 Jan 15 16:27 ts.txt
> 
> But, if I 'cat ts.txt' it shows no data. Furthermore, pdftohtml does extract
> the images, but not the text and I get the same error. pdftops produces the
> same error 6 times. I can copy and paste the text if i open it in Acroread
> for windows. So....any ideas? I can't find information on this anywhere.

Well, it looks like a problem with that particular PDF file, or the xpdf
package, or both.  Either way, it's not an htdig problem and there isn't
much we can do about it.  Make sure you have the latest version of xpdf
(from http://www.foolabs.com/xpdf/), and if the problem persists and you
know the PDF file is correct, then you may want to bring up the matter
with xpdf's author.

On an unrelated note, you may want to consider abandoning parse_doc.pl
in favour of conv_doc.pl or doc2html.pl.  It won't solve this problem,
but it may help with other problems you're likely to run into.
See http://www.htdig.org/FAQ.html#q4.9

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to