Hello...and help!
I am running htdig-3.1.5 on apache 1.3. I am trying to parse/index PDF
files and have had no success to date. I am using the pdf2text and pdfinfo
utilities from xpdf-0.92 with pdf2html.pl. When I execute the pdf2html.pl
script from the command line, I receive html output. However, when I try to
call the script through rundig, it appears to ignore the external_parsers
specification (external_parsers:
"application/pdf->text/html"/local/apache/cgi-bin/pdf2html.pl). I tried
modifying the external_parsers line as follows "application/pdf;
charset=iso-8859-1->text/html"... but this resulted in it looking for
acroread. Either way, I get the "deleted, no excerpt" message and none of
the pdf files get indexed. The files DO contain text and max_doc_size is
set to a value larger than the largest pdf file.
Is it possible that there is a setting at the server level that needs
adjusting? I have tested everything I can think of relating to the htdig
configuration file (and have read through every e-mail in the archives that
I could find).
Attached is a text file of the output I received from (./rundig -vvv)
Any help would be greatly appreciated!
Jason
1:0:http://depression.ori.org/
New server: depression.ori.org, 80
Retrieval command for http://depression.ori.org/robots.txt: GET /robots.txt HTTP/1.0
User-Agent: htdig/3.1.5 ([EMAIL PROTECTED])
Host: depression.ori.org
Header line: HTTP/1.1 404 Not Found
Header line: Date: Thu, 09 Aug 2001 18:32:15 GMT
Header line: Server: Apache/1.3.6 (Unix)
Header line: Connection: close
Header line: Content-Type: text/html
Header line:
returnStatus = 1
pushed
pick: depression.ori.org, # servers = 1
0:0:0:http://depression.ori.org/: Retrieval command for http://depression.ori.org/:
GET / HTTP/1.0
User-Agent: htdig/3.1.5 ([EMAIL PROTECTED])
Host: depression.ori.org
Header line: HTTP/1.1 200 OK
Header line: Date: Thu, 09 Aug 2001 18:32:15 GMT
Header line: Server: Apache/1.3.6 (Unix)
Header line: Last-Modified: Wed, 08 Aug 2001 23:19:49 GMT
Translated Wed, 08 Aug 2001 23:19:49 GMT to 2001-08-08 23:19:49 (101)
And converted to Wed, 08 Aug 2001 23:19:49
Header line: ETag: "1ee96-6fc-3b71c915"
Header line: Accept-Ranges: bytes
Header line: Content-Length: 1788
Header line: Connection: close
Header line: Content-Type: text/html
Header line:
returnStatus = 0
Read 1788 from document
Read a total of 1788 bytes
title: Test Page for Apache Installation on Web Site
A tag: pos = 2, position = ="http://www.apache.org/">
href: http://www.apache.org/ (Apache Web server)
Rejected: URL not in the limits!
url rejected: (level 1)http://www.apache.org/
A tag: pos = 2, position = ="http://depression.ori.org/manual/pdftest/pdf.html">
href: http://depression.ori.org/manual/pdftest/pdf.html (temp)
resolving 'http://depression.ori.org/manual/pdftest/pdf.html'
pushing http://depression.ori.org/manual/pdftest/pdf.html
+A tag: pos = 2, position = ="http://depression.ori.org/manual/pdftest/0019.html">
href: http://depression.ori.org/manual/pdftest/0019.html (temp)
resolving 'http://depression.ori.org/manual/pdftest/0019.html'
pushing http://depression.ori.org/manual/pdftest/0019.html
+A tag: pos = 5, position = ="manual/index.html"
>
href: http://depression.ori.org/manual/index.html (documentation)
Rejected: URL not in the limits!
url rejected: (level 1)http://depression.ori.org/manual/index.html
image: http://depression.ori.org/apache_pb.gif
size = 1788
pick: depression.ori.org, # servers = 1
1:1:1:http://depression.ori.org/manual/pdftest/pdf.html: Retrieval command for
http://depression.ori.org/manual/pdftest/pdf.html: GET /manual/pdftest/pdf.html
HTTP/1.0
User-Agent: htdig/3.1.5 ([EMAIL PROTECTED])
Referer: http://depression.ori.org/
Host: depression.ori.org
Header line: HTTP/1.1 200 OK
Header line: Date: Thu, 09 Aug 2001 18:32:15 GMT
Header line: Server: Apache/1.3.6 (Unix)
Header line: Last-Modified: Thu, 09 Aug 2001 18:30:24 GMT
Translated Thu, 09 Aug 2001 18:30:24 GMT to 2001-08-09 18:30:24 (101)
And converted to Thu, 09 Aug 2001 18:30:24
Header line: ETag: "b414d-1ef-3b72d6c0"
Header line: Accept-Ranges: bytes
Header line: Content-Length: 495
Header line: Connection: close
Header line: Content-Type: text/html
Header line:
returnStatus = 0
Read 495 from document
Read a total of 495 bytes
title: pdf test
A tag: pos = 2, position = ="Document1.pdf">
href: http://depression.ori.org/manual/pdftest/Document1.pdf (document 1)
resolving 'http://depression.ori.org/manual/pdftest/Document1.pdf'
pushing http://depression.ori.org/manual/pdftest/Document1.pdf
+A tag: pos = 2, position = ="0096a.pdf">
href: http://depression.ori.org/manual/pdftest/0096a.pdf (0096)
resolving 'http://depression.ori.org/manual/pdftest/0096a.pdf'
pushing http://depression.ori.org/manual/pdftest/0096a.pdf
+A tag: pos = 2, position = ="hello.pdf">
href: http://depression.ori.org/manual/pdftest/hello.pdf (hello)
resolving 'http://depression.ori.org/manual/pdftest/hello.pdf'
pushing http://depression.ori.org/manual/pdftest/hello.pdf
+A tag: pos = 2, position = ="asudoc.pdf">
href: http://depression.ori.org/manual/pdftest/asudoc.pdf (asudoc)
resolving 'http://depression.ori.org/manual/pdftest/asudoc.pdf'
pushing http://depression.ori.org/manual/pdftest/asudoc.pdf
+A tag: pos = 2, position = ="asudoc2.pdf">
href: http://depression.ori.org/manual/pdftest/asudoc2.pdf (asudoc2)
resolving 'http://depression.ori.org/manual/pdftest/asudoc2.pdf'
pushing http://depression.ori.org/manual/pdftest/asudoc2.pdf
+ size = 495
pick: depression.ori.org, # servers = 1
2:2:1:http://depression.ori.org/manual/pdftest/0019.html: Retrieval command for
http://depression.ori.org/manual/pdftest/0019.html: GET /manual/pdftest/0019.html
HTTP/1.0
User-Agent: htdig/3.1.5 ([EMAIL PROTECTED])
Referer: http://depression.ori.org/
Host: depression.ori.org
Header line: HTTP/1.1 200 OK
Header line: Date: Thu, 09 Aug 2001 18:32:15 GMT
Header line: Server: Apache/1.3.6 (Unix)
Header line: Last-Modified: Wed, 08 Aug 2001 22:47:17 GMT
Translated Wed, 08 Aug 2001 22:47:17 GMT to 2001-08-08 22:47:17 (101)
And converted to Wed, 08 Aug 2001 22:47:17
Header line: ETag: "b414f-123-3b71c175"
Header line: Accept-Ranges: bytes
Header line: Content-Length: 291
Header line: Connection: close
Header line: Content-Type: text/html
Header line:
returnStatus = 0
Read 291 from document
Read a total of 291 bytes
title: SUD T4 Living Together Questionnaire (0019) Proband
META Description: 0019
A tag: pos = 2, position = ="0019.pdf">
href: http://depression.ori.org/manual/pdftest/0019.pdf (0019.pdf)
resolving 'http://depression.ori.org/manual/pdftest/0019.pdf'
pushing http://depression.ori.org/manual/pdftest/0019.pdf
+ size = 291
pick: depression.ori.org, # servers = 1
3:3:2:http://depression.ori.org/manual/pdftest/Document1.pdf: Retrieval command for
http://depression.ori.org/manual/pdftest/Document1.pdf: GET
/manual/pdftest/Document1.pdf HTTP/1.0
User-Agent: htdig/3.1.5 ([EMAIL PROTECTED])
Referer: http://depression.ori.org/manual/pdftest/pdf.html
Host: depression.ori.org
Header line: HTTP/1.1 200 OK
Header line: Date: Thu, 09 Aug 2001 18:32:15 GMT
Header line: Server: Apache/1.3.6 (Unix)
Header line: Last-Modified: Fri, 03 Aug 2001 23:17:26 GMT
Translated Fri, 03 Aug 2001 23:17:26 GMT to 2001-08-03 23:17:26 (101)
And converted to Fri, 03 Aug 2001 23:17:26
Header line: ETag: "b414b-941-3b6b3106"
Header line: Accept-Ranges: bytes
Header line: Content-Length: 2369
Header line: Connection: close
Header line: Content-Type: application/pdf
Header line:
returnStatus = 0
Read 2369 from document
Read a total of 2369 bytes
size = 2369
pick: depression.ori.org, # servers = 1
4:4:2:http://depression.ori.org/manual/pdftest/0096a.pdf: Retrieval command for
http://depression.ori.org/manual/pdftest/0096a.pdf: GET /manual/pdftest/0096a.pdf
HTTP/1.0
User-Agent: htdig/3.1.5 ([EMAIL PROTECTED])
Referer: http://depression.ori.org/manual/pdftest/pdf.html
Host: depression.ori.org
Header line: HTTP/1.1 200 OK
Header line: Date: Thu, 09 Aug 2001 18:32:15 GMT
Header line: Server: Apache/1.3.6 (Unix)
Header line: Last-Modified: Fri, 03 Aug 2001 22:24:42 GMT
Translated Fri, 03 Aug 2001 22:24:42 GMT to 2001-08-03 22:24:42 (101)
And converted to Fri, 03 Aug 2001 22:24:42
Header line: ETag: "b414c-84642-3b6b24aa"
Header line: Accept-Ranges: bytes
Header line: Content-Length: 542274
Header line: Connection: close
Header line: Content-Type: application/pdf
Header line: X-Pad: avoid browser bug
Header line:
returnStatus = 0
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 1602 from document
Read a total of 542274 bytes
size = 542274
pick: depression.ori.org, # servers = 1
5:5:2:http://depression.ori.org/manual/pdftest/hello.pdf: Retrieval command for
http://depression.ori.org/manual/pdftest/hello.pdf: GET /manual/pdftest/hello.pdf
HTTP/1.0
User-Agent: htdig/3.1.5 ([EMAIL PROTECTED])
Referer: http://depression.ori.org/manual/pdftest/pdf.html
Host: depression.ori.org
Header line: HTTP/1.1 200 OK
Header line: Date: Thu, 09 Aug 2001 18:32:15 GMT
Header line: Server: Apache/1.3.6 (Unix)
Header line: Last-Modified: Wed, 08 Aug 2001 22:23:17 GMT
Translated Wed, 08 Aug 2001 22:23:17 GMT to 2001-08-08 22:23:17 (101)
And converted to Wed, 08 Aug 2001 22:23:17
Header line: ETag: "b414e-395-3b71bbd5"
Header line: Accept-Ranges: bytes
Header line: Content-Length: 917
Header line: Connection: close
Header line: Content-Type: application/pdf
Header line:
returnStatus = 0
Read 917 from document
Read a total of 917 bytes
size = 917
pick: depression.ori.org, # servers = 1
6:6:2:http://depression.ori.org/manual/pdftest/asudoc.pdf: Retrieval command for
http://depression.ori.org/manual/pdftest/asudoc.pdf: GET /manual/pdftest/asudoc.pdf
HTTP/1.0
User-Agent: htdig/3.1.5 ([EMAIL PROTECTED])
Referer: http://depression.ori.org/manual/pdftest/pdf.html
Host: depression.ori.org
Header line: HTTP/1.1 200 OK
Header line: Date: Thu, 09 Aug 2001 18:32:15 GMT
Header line: Server: Apache/1.3.6 (Unix)
Header line: Last-Modified: Thu, 09 Aug 2001 01:40:46 GMT
Translated Thu, 09 Aug 2001 01:40:46 GMT to 2001-08-09 01:40:46 (101)
And converted to Thu, 09 Aug 2001 01:40:46
Header line: ETag: "b4157-6256-3b71ea1e"
Header line: Accept-Ranges: bytes
Header line: Content-Length: 25174
Header line: Connection: close
Header line: Content-Type: application/pdf
Header line:
returnStatus = 0
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 598 from document
Read a total of 25174 bytes
size = 25174
pick: depression.ori.org, # servers = 1
7:7:2:http://depression.ori.org/manual/pdftest/asudoc2.pdf: Retrieval command for
http://depression.ori.org/manual/pdftest/asudoc2.pdf: GET /manual/pdftest/asudoc2.pdf
HTTP/1.0
User-Agent: htdig/3.1.5 ([EMAIL PROTECTED])
Referer: http://depression.ori.org/manual/pdftest/pdf.html
Host: depression.ori.org
Header line: HTTP/1.1 200 OK
Header line: Date: Thu, 09 Aug 2001 18:32:15 GMT
Header line: Server: Apache/1.3.6 (Unix)
Header line: Last-Modified: Thu, 09 Aug 2001 01:40:49 GMT
Translated Thu, 09 Aug 2001 01:40:49 GMT to 2001-08-09 01:40:49 (101)
And converted to Thu, 09 Aug 2001 01:40:49
Header line: ETag: "b4158-b800-3b71ea21"
Header line: Accept-Ranges: bytes
Header line: Content-Length: 47104
Header line: Connection: close
Header line: Content-Type: application/pdf
Header line:
returnStatus = 0
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 6144 from document
Read a total of 47104 bytes
size = 47104
pick: depression.ori.org, # servers = 1
8:8:2:http://depression.ori.org/manual/pdftest/0019.pdf: Retrieval command for
http://depression.ori.org/manual/pdftest/0019.pdf: GET /manual/pdftest/0019.pdf
HTTP/1.0
User-Agent: htdig/3.1.5 ([EMAIL PROTECTED])
Referer: http://depression.ori.org/manual/pdftest/0019.html
Host: depression.ori.org
Header line: HTTP/1.1 200 OK
Header line: Date: Thu, 09 Aug 2001 18:32:15 GMT
Header line: Server: Apache/1.3.6 (Unix)
Header line: Last-Modified: Wed, 08 Aug 2001 22:45:52 GMT
Translated Wed, 08 Aug 2001 22:45:52 GMT to 2001-08-08 22:45:52 (101)
And converted to Wed, 08 Aug 2001 22:45:52
Header line: ETag: "b4150-84b32-3b71c120"
Header line: Accept-Ranges: bytes
Header line: Content-Length: 543538
Header line: Connection: close
Header line: Content-Type: application/pdf
Header line: X-Pad: avoid browser bug
Header line:
returnStatus = 0
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 2866 from document
Read a total of 543538 bytes
size = 543538
pick: depression.ori.org, # servers = 1
htdig: Run complete
htdig: 1 server seen:
htdig: depression.ori.org:80 9 documents
htmerge: Sorting...
htmerge: Merging...
htmerge: Total word count: 81
0/http://depression.ori.org/
2/http://depression.ori.org/manual/pdftest/0019.html
Deleted, no excerpt: 8/http://depression.ori.org/manual/pdftest/0019.pdf
Deleted, no excerpt: 4/http://depression.ori.org/manual/pdftest/0096a.pdf
Deleted, no excerpt: 3/http://depression.ori.org/manual/pdftest/Document1.pdf
Deleted, no excerpt: 6/http://depression.ori.org/manual/pdftest/asudoc.pdf
Deleted, no excerpt: 7/http://depression.ori.org/manual/pdftest/asudoc2.pdf
Deleted, no excerpt: 5/http://depression.ori.org/manual/pdftest/hello.pdf
1/http://depression.ori.org/manual/pdftest/pdf.html
htmerge: Total documents: 3
htmerge: Total doc db size (in K): 2