Your external_parsers: statement looks OK, but it is being ignored by htdig.  Make sure that the '\' is the last character on each line.  Do not have a '\' on the last line.  Start every line but the first with at least one space.
 
David Adams
Southampton University
----- Original Message -----
Sent: Thursday, April 15, 2004 11:28 AM
Subject: [htdig] PDF Contents not being parsed


Hi,
I am trying to parse PDF documents but htdig doesn't parse the contents. I am only getting the File name as a result of the search

doc2html parses these files properly when run from commandline. But with htdig it doesn't. Can someone let me know what the problem is?

My htdig.conf file is
----------------------------

database_dir:           /var/lib/htdig
start_url:      http://MySite/PostNuke/html/Downloads/
limit_urls_to:          ${start_url}
exclude_urls:           /cgi-bin/ .cgi  C=D C=M C=N C=S O=A O=D
bad_extensions:         .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif \
        .jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi .css

maintainer:     [EMAIL PROTECTED]
max_head_length:        10000
max_doc_size:           1000000
no_excerpt_show_top:    true
search_algorithm:       exact:1 synonyms:0.5 endings:0.1
external_parsers:
application/rtf->text/html /var/www/html/doc2html/doc2html.pl \
text/rtf->text/html /var/www/html/doc2html/doc2html.pl \
application/pdf->text/html /var/www/html/doc2html/doc2html.pl \
application/postscript->text/html /var/www/html/doc2html/doc2html.pl \
application/msword->text/html /var/www/html/doc2html/doc2html.pl \
application/msexcel->text/html /var/www/html/doc2html/doc2html.pl \
application/vnd.ms-excel->text/html /var/www/html/doc2html/doc2html.pl \
application/vnd.ms-powerpoint->text/html /var/www/html/doc2html/doc2html.pl \

----------------------------

Output of $htdig -vvvv
        0:1:http://mysite/PostNuke/html/Downloads/
New server: mysite, 80
Retrieval command for http://mysite/robots.txt: GET /robots.txt HTTP/1.0^M
User-Agent: htdig/3.1.6 ([EMAIL PROTECTED])^M
Host: mysite^M
^M
Header line: HTTP/1.1 404 Not Found
Header line: Date: Thu, 15 Apr 2004 10:19:51 GMT
Header line: Server: Apache/2.0.40 (Red Hat Linux)
Header line: Vary: accept-language
Header line: Accept-Ranges: bytes
Header line: Content-Length: 1066
Header line: Connection: close
Header line: Content-Type: text/html; charset=ISO-8859-1
Header line: Expires: Thu, 15 Apr 2004 10:19:51 GMT
Header line:
returnStatus = 1
 pushed
        0:1:http://mysite/PostNuke/html/Downloads/Test.pdf pushed
        1:1:http://mysite/PostNuke/html/Downloads/ skipped
pick: mysite, # servers = 1
0:2:0:http://mysite/PostNuke/html/Downloads/: Retrieval command for http://mysite/PostNuke/html/Downloads/: GET /PostNuke/html/Downloads/ HTTP/1.0^M
User-Agent: htdig/3.1.6 ([EMAIL PROTECTED])^M
If-Modified-Since: Thu, 15 Apr 2004 10:19:34 GMT^M
Host: mysite^M
^M
Header line: HTTP/1.1 200 OK
Header line: Date: Thu, 15 Apr 2004 10:19:51 GMT
Header line: Server: Apache/2.0.40 (Red Hat Linux)
Header line: Content-Length: 736
Header line: Connection: close
Header line: Content-Type: text/html; charset=ISO-8859-1
Header line:
returnStatus = 0
Read 736 from document
Read a total of 736 bytes
 (changed) Tag: <html>, matched -1
Tag: <head>, matched -1
Tag: <title>, matched 0
word: [EMAIL PROTECTED]
word: PostNuke/html/[EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: PostNuke/[EMAIL PROTECTED]
word part: html/[EMAIL PROTECTED]
Tag: </title>, matched 1

title: Index of /PostNuke/html/Downloads
Tag: </head>, matched -1
Tag: <body>, matched -1
Tag: <h1>, matched 4
word: [EMAIL PROTECTED]
word: PostNuke/html/[EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: PostNuke/[EMAIL PROTECTED]
word part: html/[EMAIL PROTECTED]
Tag: </h1>, matched 10
Tag: <pre>, matched -1
Tag: <img src="" alt="Icon " />, matched 18
word: [EMAIL PROTECTED]
image: http://mysite/icons/blank.gif
Tag: <a href="">, matched 2
word: [EMAIL PROTECTED]
Tag: </a>, matched 3
href: http://mysite/PostNuke/html/Downloads/?C=N&O=D (Name)

  Rejected: Item in the exclude list: item # 5 length: 3

url rejected: (level 1)http://mysite/PostNuke/html/Downloads/?C=N&O=D
Tag: <a href="">, matched 2
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
Tag: </a>, matched 3
href: http://mysite/PostNuke/html/Downloads/?C=M&O=A (Last modified)

  Rejected: Item in the exclude list: item # 4 length: 3

url rejected: (level 1)http://mysite/PostNuke/html/Downloads/?C=M&O=A
Tag: <a href="">, matched 2
word: [EMAIL PROTECTED]
Tag: </a>, matched 3
href: http://mysite/PostNuke/html/Downloads/?C=S&O=A (Size)

  Rejected: Item in the exclude list: item # 6 length: 3

url rejected: (level 1)http://mysite/PostNuke/html/Downloads/?C=S&O=A
Tag: <a href="">, matched 2
word: [EMAIL PROTECTED]
Tag: </a>, matched 3
Tag: </a>, matched 3
href: http://mysite/PostNuke/html/Downloads/?C=D&O=A (Description)

  Rejected: Item in the exclude list: item # 3 length: 3

url rejected: (level 1)http://mysite/PostNuke/html/Downloads/?C=D&O=A
Tag: <hr />, matched -1
Tag: <img src="" alt="[DIR]" />, matched 18
word: [EMAIL PROTECTED]
image: http://mysite/icons/back.gif
Tag: <a href="">, matched 2
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
Tag: </a>, matched 3
href: http://mysite/PostNuke/html/ (Parent Directory)

   Rejected: URL not in the limits!
url rejected: (level 1)http://mysite/PostNuke/html/
Tag: <img src="" alt="[   ]" />, matched 18
image: http://mysite/icons/layout.gif
Tag: <a href="">, matched 2
word: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
Tag: </a>, matched 3
href: http://mysite/PostNuke/html/Downloads/Test.pdf (Test.pdf)
resolving 'http://mysite/PostNuke/html/Downloads/Test.pdf'
*word: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
Tag: <hr />, matched -1
Tag: </pre>, matched -1
Tag: <address>, matched -1
word: Apache/[EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: Apache/[EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: Apache/[EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
Tag: </address>, matched -1
Tag: </body>, matched -1
Tag: </html>, matched -1
 size = 736
pick: mysite, # servers = 1
1:3:1:http://mysite/PostNuke/html/Downloads/Test.pdf: Retrieval command for http://mysite/PostNuke/html/Downloads/Test.pdf: GET /PostNuke/html/Downloads/Test.pdf HTTP/1.0^M
User-Agent: htdig/3.1.6 ([EMAIL PROTECTED])^M
If-Modified-Since: Thu, 15 Apr 2004 08:14:43 GMT^M
Host: mysite^M
^M
Header line: HTTP/1.1 304 Not Modified
Header line: Date: Thu, 15 Apr 2004 10:19:51 GMT
Header line: Server: Apache/2.0.40 (Red Hat Linux)
Header line: Connection: close
Header line: ETag: "4e244-25b54-aff3c2c0"
Header line:
returnStatus = 2
 not changed
pick: mysite, # servers = 1
-----------------------------------
A part of the $rundig -vvvv output
--------------------
Read 8192 from document
Read 6996 from document
Read a total of 154452 bytes                  // The file size is correct
PDF::setContents(154452 bytes)
PDF::parse(http://172.17.127.60/PostNuke/html/Downloads/Test.pdf)
PDF::parseNonTextLine: title is "Capability_4_1_June2002.PDF"
.
.
.
title: Capability_4_1_June2002.PDF
PDF::parseNonTextLine: total pages is 14
PDF::parseNonTextLine: start page 1
PDF::parseNonTextLine: begin text block
PDF::parseTextLine("297.59999 732.23999 TD") cmd=TD
PDF::parseTextLine("0 0 0 rg") cmd=rg
PDF::parseTextLine("/N6 28.07998 Tf") cmd=Tf
.
.
.
PDF::parseTextLine("0 Tc") cmd=Tc
PDF::parseTextLine("0.13198 Tw") cmd=Tw
PDF::parseTextLine("( )Tj ") cmd=
PDF::parseTextLine("(EXECUTIVE SUMMARY)Tj ") cmd=
PDF::parseTextLine("114.95999 0 TD") cmd=TD
PDF::parseTextLine("-0.00479 Tc") cmd=Tc
PDF::parseTextLine("0 Tw") cmd=Tw
PDF::parseTextLine("(................................)Tj ") cmd=
PDF::parseTextLine("99.83999 0 TD") cmd=TD
PDF::parseTextLine("(................................)Tj ") cmd=
PDF::parseTextLine("99.83999 0 TD") cmd=TD
PDF::parseTextLine("(................................)Tj ") cmd=
PDF::parseTextLine("99.83999 0 TD") cmd=TD
PDF::parseTextLine("(..)Tj ") cmd=
.
.
.
.
PDF::parseTextLine("ET") cmd=ET
PDF::parse: head = ""
PDF::parse: 83919 lines parsed
PDF::parse ends normally
 size = 154452
pick: 172.17.127.60, # servers = 1
htmerge: Sorting...
htmerge: Merging...

0/http://mysite/PostNuke/html/Downloads/
Deleted, no excerpt: 1/http://mysite/PostNuke/html/Downloads/Test.pdf

Thanks,
Neha Verma

Reply via email to