|
Your external_parsers: statement looks OK, but it
is being ignored by htdig. Make sure that the '\' is the last character on
each line. Do not have a '\' on the last line. Start every line but
the first with at least one space.
David Adams
Southampton University
----- Original Message -----
Sent: Thursday, April 15, 2004 11:28
AM
Subject: [htdig] PDF Contents not being
parsed
Hi, I am trying to parse PDF documents but htdig doesn't
parse the contents. I am only getting the File name as a result of the
search
doc2html parses these files
properly when run from commandline. But with htdig it doesn't. Can someone let
me know what the problem is?
My
htdig.conf file is ----------------------------
database_dir:
/var/lib/htdig start_url:
http://MySite/PostNuke/html/Downloads/ limit_urls_to:
${start_url} exclude_urls: /cgi-bin/ .cgi
C=D C=M C=N C=S O=A O=D bad_extensions: .wav .gz .z .sit .au
.zip .tar .hqx .exe .com .gif \
.jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg
.mov .avi .css
maintainer:
[EMAIL PROTECTED] max_head_length: 10000
max_doc_size:
1000000 no_excerpt_show_top: true search_algorithm: exact:1
synonyms:0.5 endings:0.1 external_parsers: application/rtf->text/html /var/www/html/doc2html/doc2html.pl
\ text/rtf->text/html
/var/www/html/doc2html/doc2html.pl \ application/pdf->text/html /var/www/html/doc2html/doc2html.pl
\ application/postscript->text/html
/var/www/html/doc2html/doc2html.pl \ application/msword->text/html /var/www/html/doc2html/doc2html.pl
\ application/msexcel->text/html
/var/www/html/doc2html/doc2html.pl \ application/vnd.ms-excel->text/html
/var/www/html/doc2html/doc2html.pl \ application/vnd.ms-powerpoint->text/html
/var/www/html/doc2html/doc2html.pl \
----------------------------
Output of $htdig -vvvv
0:1:http://mysite/PostNuke/html/Downloads/ New server: mysite, 80 Retrieval command for http://mysite/robots.txt: GET /robots.txt
HTTP/1.0^M User-Agent: htdig/3.1.6
([EMAIL PROTECTED])^M Host:
mysite^M ^M Header line: HTTP/1.1 404 Not Found Header line: Date: Thu, 15 Apr 2004 10:19:51 GMT
Header line: Server: Apache/2.0.40 (Red Hat
Linux) Header line: Vary:
accept-language Header line:
Accept-Ranges: bytes Header line:
Content-Length: 1066 Header line:
Connection: close Header line:
Content-Type: text/html; charset=ISO-8859-1 Header line: Expires: Thu, 15 Apr 2004 10:19:51 GMT Header line: returnStatus = 1 pushed
0:1:http://mysite/PostNuke/html/Downloads/Test.pdf pushed
1:1:http://mysite/PostNuke/html/Downloads/ skipped pick: mysite, # servers = 1 0:2:0:http://mysite/PostNuke/html/Downloads/: Retrieval
command for http://mysite/PostNuke/html/Downloads/: GET
/PostNuke/html/Downloads/ HTTP/1.0^M User-Agent: htdig/3.1.6 ([EMAIL PROTECTED])^M If-Modified-Since: Thu, 15 Apr 2004 10:19:34
GMT^M Host: mysite^M ^M Header line:
HTTP/1.1 200 OK Header line: Date:
Thu, 15 Apr 2004 10:19:51 GMT Header
line: Server: Apache/2.0.40 (Red Hat Linux) Header line: Content-Length: 736 Header line: Connection: close Header line: Content-Type: text/html; charset=ISO-8859-1
Header line: returnStatus = 0 Read 736 from
document Read a total of 736
bytes (changed) Tag:
<html>, matched -1 Tag:
<head>, matched -1 Tag:
<title>, matched 0 word:
[EMAIL PROTECTED] word:
PostNuke/html/[EMAIL PROTECTED] word part:
[EMAIL PROTECTED] word part: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED] word part: PostNuke/[EMAIL PROTECTED] word part: html/[EMAIL PROTECTED] Tag: </title>, matched 1
title: Index of /PostNuke/html/Downloads
Tag: </head>, matched -1
Tag: <body>, matched -1
Tag: <h1>, matched 4 word: [EMAIL PROTECTED] word: PostNuke/html/[EMAIL PROTECTED] word part: [EMAIL PROTECTED] word
part: [EMAIL PROTECTED] word part:
[EMAIL PROTECTED] word part:
PostNuke/[EMAIL PROTECTED] word part:
html/[EMAIL PROTECTED] Tag: </h1>,
matched 10 Tag: <pre>, matched
-1 Tag: <img src=""
alt="Icon " />, matched 18 word:
[EMAIL PROTECTED] image:
http://mysite/icons/blank.gif Tag:
<a href="">, matched 2 word: [EMAIL PROTECTED] Tag: </a>,
matched 3 href:
http://mysite/PostNuke/html/Downloads/?C=N&O=D (Name)
Rejected: Item in the exclude list: item # 5
length: 3
url rejected: (level
1)http://mysite/PostNuke/html/Downloads/?C=N&O=D Tag: <a href="">, matched
2 word: [EMAIL PROTECTED] word: [EMAIL PROTECTED] Tag: </a>, matched 3 href: http://mysite/PostNuke/html/Downloads/?C=M&O=A (Last
modified)
Rejected: Item in
the exclude list: item # 4 length: 3
url rejected: (level
1)http://mysite/PostNuke/html/Downloads/?C=M&O=A Tag: <a href="">, matched
2 word: [EMAIL PROTECTED] Tag: </a>, matched 3 href:
http://mysite/PostNuke/html/Downloads/?C=S&O=A (Size)
Rejected: Item in the exclude list: item # 6
length: 3
url rejected: (level
1)http://mysite/PostNuke/html/Downloads/?C=S&O=A Tag: <a href="">, matched
2 word: [EMAIL PROTECTED]
Tag: </a>, matched 3 Tag: </a>, matched 3 href:
http://mysite/PostNuke/html/Downloads/?C=D&O=A (Description)
Rejected: Item in the exclude
list: item # 3 length: 3
url
rejected: (level 1)http://mysite/PostNuke/html/Downloads/?C=D&O=A
Tag: <hr />, matched -1
Tag: <img src=""
alt="[DIR]" />, matched 18 word:
[EMAIL PROTECTED] image:
http://mysite/icons/back.gif Tag:
<a href="">, matched 2 word: [EMAIL PROTECTED] word:
[EMAIL PROTECTED] Tag: </a>, matched
3 href: http://mysite/PostNuke/html/
(Parent Directory)
Rejected: URL not in the limits! url rejected: (level 1)http://mysite/PostNuke/html/ Tag: <img src="" alt="[ ]"
/>, matched 18 image:
http://mysite/icons/layout.gif Tag:
<a href="">, matched 2 word: [EMAIL PROTECTED] word part:
[EMAIL PROTECTED] word part: [EMAIL PROTECTED]
Tag: </a>, matched 3 href: http://mysite/PostNuke/html/Downloads/Test.pdf
(Test.pdf) resolving
'http://mysite/PostNuke/html/Downloads/Test.pdf' *word: [EMAIL PROTECTED] word part: [EMAIL PROTECTED] word part:
[EMAIL PROTECTED] word part: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED] word: [EMAIL PROTECTED] Tag: <hr />, matched -1 Tag: </pre>, matched -1 Tag: <address>, matched -1 word: Apache/[EMAIL PROTECTED] word
part: [EMAIL PROTECTED] word part:
Apache/[EMAIL PROTECTED] word part:
[EMAIL PROTECTED] word part:
Apache/[EMAIL PROTECTED] word part:
[EMAIL PROTECTED] word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED] word part: [EMAIL PROTECTED] word part: [EMAIL PROTECTED] word part:
[EMAIL PROTECTED] word part:
[EMAIL PROTECTED] word part:
[EMAIL PROTECTED] word part:
[EMAIL PROTECTED] word part:
[EMAIL PROTECTED] word: [EMAIL PROTECTED]
Tag: </address>, matched -1
Tag: </body>, matched -1
Tag: </html>, matched -1
size = 736 pick: mysite, # servers = 1 1:3:1:http://mysite/PostNuke/html/Downloads/Test.pdf:
Retrieval command for http://mysite/PostNuke/html/Downloads/Test.pdf: GET
/PostNuke/html/Downloads/Test.pdf HTTP/1.0^M User-Agent: htdig/3.1.6 ([EMAIL PROTECTED])^M If-Modified-Since: Thu, 15 Apr 2004 08:14:43
GMT^M Host: mysite^M ^M Header line:
HTTP/1.1 304 Not Modified Header line:
Date: Thu, 15 Apr 2004 10:19:51 GMT Header line: Server: Apache/2.0.40 (Red Hat Linux) Header line: Connection: close Header line: ETag: "4e244-25b54-aff3c2c0"
Header line: returnStatus = 2 not
changed pick: mysite, # servers =
1 ----------------------------------- A part of the $rundig -vvvv output -------------------- Read 8192 from document Read
6996 from document Read a total of
154452 bytes //
The file size is correct PDF::setContents(154452 bytes) PDF::parse(http://172.17.127.60/PostNuke/html/Downloads/Test.pdf)
PDF::parseNonTextLine: title is
"Capability_4_1_June2002.PDF" .
. . title:
Capability_4_1_June2002.PDF PDF::parseNonTextLine: total pages is 14 PDF::parseNonTextLine: start page 1 PDF::parseNonTextLine: begin text block
PDF::parseTextLine("297.59999 732.23999 TD")
cmd=TD PDF::parseTextLine("0 0 0 rg")
cmd=rg PDF::parseTextLine("/N6
28.07998 Tf") cmd=Tf .
. . PDF::parseTextLine("0 Tc")
cmd=Tc PDF::parseTextLine("0.13198
Tw") cmd=Tw PDF::parseTextLine("( )Tj
") cmd= PDF::parseTextLine("(EXECUTIVE
SUMMARY)Tj ") cmd= PDF::parseTextLine("114.95999 0 TD") cmd=TD PDF::parseTextLine("-0.00479 Tc") cmd=Tc
PDF::parseTextLine("0 Tw") cmd=Tw
PDF::parseTextLine("(................................)Tj ") cmd=
PDF::parseTextLine("99.83999 0 TD")
cmd=TD PDF::parseTextLine("(................................)Tj ") cmd=
PDF::parseTextLine("99.83999 0 TD")
cmd=TD PDF::parseTextLine("(................................)Tj ") cmd=
PDF::parseTextLine("99.83999 0 TD")
cmd=TD PDF::parseTextLine("(..)Tj ")
cmd= . . .
. PDF::parseTextLine("ET") cmd=ET PDF::parse: head = "" PDF::parse: 83919 lines parsed PDF::parse ends normally size = 154452 pick:
172.17.127.60, # servers = 1 htmerge:
Sorting... htmerge: Merging...
0/http://mysite/PostNuke/html/Downloads/ Deleted, no excerpt:
1/http://mysite/PostNuke/html/Downloads/Test.pdf
Thanks, Neha
Verma
|