You are the man. Sorry about sending two emails, the first time I sent it to
just you, but I have found so many postings with this same problem, I wanted
to let everyone in on this. There was a space and I would never had guessed.
It works like a charm now.

Chris

-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]]On Behalf Of Gilles
Detillieux
Sent: Tuesday, April 09, 2002 4:00 PM
To: Christian Fredrickson
Cc: Rzepa Henry; [EMAIL PROTECTED]
Subject: Re: [htdig] PDF problems


According to Christian Fredrickson:
> To remind you all, this works fine on .DOC files, not on PDF.

Got it the first time.  I can't always reply immediately.

> Running
> pdf2html from the command line outputs text perfectly. I ran htdig
> with -vvvv -s and here is the output while the start URL is set to one PDF
> file.
>
> Here are my external parser lines in the .conf file:
>
>
> external_parsers: application/msword->text/html
> /home/ch/staff/fredrick/htdig/contrib/doc2html/doc2html.pl \
>                      application/pdf->text/html
> /home/ch/staff/fredrick/htdig/contrib/doc2html/doc2html.pl \
>                      application/postscript->text/html
> /home/ch/staff/fredrick/htdig/contrib/doc2html/doc2html.pl

I'm assuming these lines got mangled by your mail program, so they don't
tell me what they REALLY look like in your config file.  The absolutely
critical point is that there MUST not be ANY space characters between
the backslash at the end of the line and the newline character.  Based on
the htdig output you sent, it's obvious that htdig is not even calling
doc2html.pl for application/pdf files, but is instead falling back on the
pdf_parser.  I'll wager there's a space after the backslash on the first
line of external_parsers in your config file.

See http://www.htdig.org/FAQ.html#q5.31

> Here is the output from htdig:
...
> 0:0:0:http://www.mydomain.com/minutes/facultyMeetingMinutes11_09_2000.pdf:
> Retrieval command for
> http://www.mydomain.com/minutes/facultyMeetingMinutes11_09_2000.pdf: GET
> /minutes/facultyMeetingMinutes11_09_2000.pdf HTTP/1.0
> User-Agent: htdig/3.1.6 ([EMAIL PROTECTED])
> Authorization: Basic d2ViOmNoZXdlYjY2Ng==
> Host: www.mydomain.com
>
> Header line: HTTP/1.1 200 OK
> Header line: Date: Mon, 08 Apr 2002 15:40:06 GMT
> Header line: Server: Apache/1.3.9 (Unix) PHP/3.0.14 mod_perl/1.21
> mod_ssl/2.4.10 OpenSSL/0.9.4
> Header line: Last-Modified: Tue, 02 Apr 2002 19:22:36 GMT
> Converted Tue, 02 Apr 2002 19:22:36 GMT to Tue, 02 Apr 2002 19:22:36
> Header line: ETag: "2d53e-22ea-3caa04fc"
> Header line: Accept-Ranges: bytes
> Header line: Content-Length: 8938
> Header line: Connection: close
> Header line: Content-Type: application/pdf

OK, that's the correct content-type for PDFs...

> Header line:
> returnStatus = 0
> Read 8192 from document
> Read 746 from document
> Read a total of 8938 bytes
> PDF::setContents(8938 bytes)
>
PDF::parse(http://www.mydomain.com/minutes/facultyMeetingMinutes11_09_2000.p
> df)
> PDF::parseNonTextLine: title is "CHEMICAL AND FUELS ENGINEERING"

However, these PDF:: messages are coming from the PDF class, which
is only used when not overridden by an external_parsers entry for
application/pdf, or if the Content-Type header has garbage characters
after the application/pdf string.

--
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:
http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to
<[EMAIL PROTECTED]> with a subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html


_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to