You are the man. Sorry about sending two emails, the first time I sent it to just you, but I have found so many postings with this same problem, I wanted to let everyone in on this. There was a space and I would never had guessed. It works like a charm now.
Chris -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Gilles Detillieux Sent: Tuesday, April 09, 2002 4:00 PM To: Christian Fredrickson Cc: Rzepa Henry; [EMAIL PROTECTED] Subject: Re: [htdig] PDF problems According to Christian Fredrickson: > To remind you all, this works fine on .DOC files, not on PDF. Got it the first time. I can't always reply immediately. > Running > pdf2html from the command line outputs text perfectly. I ran htdig > with -vvvv -s and here is the output while the start URL is set to one PDF > file. > > Here are my external parser lines in the .conf file: > > > external_parsers: application/msword->text/html > /home/ch/staff/fredrick/htdig/contrib/doc2html/doc2html.pl \ > application/pdf->text/html > /home/ch/staff/fredrick/htdig/contrib/doc2html/doc2html.pl \ > application/postscript->text/html > /home/ch/staff/fredrick/htdig/contrib/doc2html/doc2html.pl I'm assuming these lines got mangled by your mail program, so they don't tell me what they REALLY look like in your config file. The absolutely critical point is that there MUST not be ANY space characters between the backslash at the end of the line and the newline character. Based on the htdig output you sent, it's obvious that htdig is not even calling doc2html.pl for application/pdf files, but is instead falling back on the pdf_parser. I'll wager there's a space after the backslash on the first line of external_parsers in your config file. See http://www.htdig.org/FAQ.html#q5.31 > Here is the output from htdig: ... > 0:0:0:http://www.mydomain.com/minutes/facultyMeetingMinutes11_09_2000.pdf: > Retrieval command for > http://www.mydomain.com/minutes/facultyMeetingMinutes11_09_2000.pdf: GET > /minutes/facultyMeetingMinutes11_09_2000.pdf HTTP/1.0 > User-Agent: htdig/3.1.6 ([EMAIL PROTECTED]) > Authorization: Basic d2ViOmNoZXdlYjY2Ng== > Host: www.mydomain.com > > Header line: HTTP/1.1 200 OK > Header line: Date: Mon, 08 Apr 2002 15:40:06 GMT > Header line: Server: Apache/1.3.9 (Unix) PHP/3.0.14 mod_perl/1.21 > mod_ssl/2.4.10 OpenSSL/0.9.4 > Header line: Last-Modified: Tue, 02 Apr 2002 19:22:36 GMT > Converted Tue, 02 Apr 2002 19:22:36 GMT to Tue, 02 Apr 2002 19:22:36 > Header line: ETag: "2d53e-22ea-3caa04fc" > Header line: Accept-Ranges: bytes > Header line: Content-Length: 8938 > Header line: Connection: close > Header line: Content-Type: application/pdf OK, that's the correct content-type for PDFs... > Header line: > returnStatus = 0 > Read 8192 from document > Read 746 from document > Read a total of 8938 bytes > PDF::setContents(8938 bytes) > PDF::parse(http://www.mydomain.com/minutes/facultyMeetingMinutes11_09_2000.p > df) > PDF::parseNonTextLine: title is "CHEMICAL AND FUELS ENGINEERING" However, these PDF:: messages are coming from the PDF class, which is only used when not overridden by an external_parsers entry for application/pdf, or if the Content-Type header has garbage characters after the application/pdf string. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

