Hello,
I am using HTDIG 3.1.5 on Redhat 7.0, and am having problems indexing PDF
files. I have included my config -vv output below. I have no robots.txt
file, and my max_doc_size is now 10M (one test .pdf file is only 27K and it
also fails), as well as not rejecting pdf as an extension.
I am using the latest xpdf with pdftotext, as well as the latest parse_doc
and conv_doc scripts.
I can manually pdftotext the pdf files and they do contain real text, not
just images, I can also run parse_doc and conv_doc.plthey produce proper
text. WHen I do a rundig, I get a 'URL rejected' message, I do not know
why, this (I presume) leads to a Deleted No Excerpt message and the file (or
any pdf file) is not indexed. Any suggestions??
Regards,
Tony
___BELOW is my CONFIG
external_parsers: application/msword /usr/bin/parse_doc.pl \
application/postscript /usr/bin/parse_doc.pl \
application/pdf /usr/bin/parse_doc.pl
database_dir: /data/software/htdigdb
local_urls: http://80.1.1.4/=/var/www/html/
start_url: http://80.1.1.4/htdig/
limit_urls_to: ${start_url}
exclude_urls: /cgi-bin/ .cgi
bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif
.iso\
.jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov
.avi
maintainer: [EMAIL PROTECTED]
max_head_length:5
max_doc_size: 1000
no_excerpt_show_top:true
search_algorithm: exact:1 synonyms:0.5 endings:0.1
no_next_page_text:
no_prev_page_text:
Below is output of rundig -vv using 2 pdf files and 1 txt and
files __
New server: 80.1.1.4, 80
Trying local files
tried local file /var/www/html/robots.txt
Local retrieval failed, trying HTTP
pick: 80.1.1.4, # servers = 1
0:0:0:http://80.1.1.4/htdig/mx59pro/manual/english/: Trying local files
tried local file /var/www/html/htdig/mx59pro/manual/english/index.html
Local retrieval failed, trying HTTP
title: Index of /htdig/mx59pro/manual/english
A tag: pos = 2, position = ="?N=D"
pushing http://80.1.1.4/htdig/mx59pro/manual/english/?N=D
+A tag: pos = 2, position = ="?M=A"
pushing http://80.1.1.4/htdig/mx59pro/manual/english/?M=A
+A tag: pos = 2, position = ="?S=A"
pushing http://80.1.1.4/htdig/mx59pro/manual/english/?S=A
+A tag: pos = 2, position = ="?D=A"
pushing http://80.1.1.4/htdig/mx59pro/manual/english/?D=A
+A tag: pos = 2, position = ="/htdig/mx59pro/manual/"
url rejected: (level 1)http://80.1.1.4/htdig/mx59pro/manual/
A tag: pos = 2, position = ="content.pdf"
pushing http://80.1.1.4/htdig/mx59pro/manual/english/content.pdf
+A tag: pos = 2, position = ="content.txt"
pushing http://80.1.1.4/htdig/mx59pro/manual/english/content.txt
+A tag: pos = 2, position = ="sonic.pdf"
pushing http://80.1.1.4/htdig/mx59pro/manual/english/sonic.pdf
+ size = 954
pick: 80.1.1.4, # servers = 1
1:1:1:http://80.1.1.4/htdig/mx59pro/manual/english/?N=D: Trying local files
tried local file /var/www/html/htdig/mx59pro/manual/english/?N=D
Local retrieval failed, trying HTTP
title: Index of /htdig/mx59pro/manual/english
A tag: pos = 2, position = ="?N=A"
pushing http://80.1.1.4/htdig/mx59pro/manual/english/?N=A
+A tag: pos = 2, position = ="?M=A"
*A tag: pos = 2, position = ="?S=A"
*A tag: pos = 2, position = ="?D=A"
*A tag: pos = 2, position = ="/htdig/mx59pro/manual/"
url rejected: (level 1)http://80.1.1.4/htdig/mx59pro/manual/
A tag: pos = 2, position = ="sonic.pdf"
*A tag: pos = 2, position = ="content.txt"
*A tag: pos = 2, position = ="content.pdf"
* size = 954
pick: 80.1.1.4, # servers = 1
2:2:1:http://80.1.1.4/htdig/mx59pro/manual/english/?M=A: Trying local files
tried local file /var/www/html/htdig/mx59pro/manual/english/?M=A
Local retrieval failed, trying HTTP
title: Index of /htdig/mx59pro/manual/english
A tag: pos = 2, position = ="?N=A"
*A tag: pos = 2, position = ="?M=D"
pushing http://80.1.1.4/htdig/mx59pro/manual/english/?M=D
+A tag: pos = 2, position = ="?S=A"
*A tag: pos = 2, position = ="?D=A"
*A tag: pos = 2, position = ="/htdig/mx59pro/manual/"
url rejected: (level 1)http://80.1.1.4/htdig/mx59pro/manual/
A tag: pos = 2, position = ="content.pdf"
*A tag: pos = 2, position = ="sonic.pdf"
*A tag: pos = 2, position = ="content.txt"
* size = 954
pick: 80.1.1.4, # servers = 1
3:3:1:http://80.1.1.4/htdig/mx59pro/manual/english/?S=A: Trying local files
tried local file /var/www/html/htdig/mx59pro/manual/english/?S=A
Local retrieval failed, trying HTTP
title: Index of /htdig/mx59pro/manual/english
A tag: pos = 2, position = ="?N=A"
*A tag: pos = 2, position = ="?M=A"
*A tag: pos = 2, position = ="?S=D"
pushing http://80.1.1.4/htdig/mx59pro/manual/english/?S=D
+A tag: pos = 2, position = ="?D=A"
*A tag: pos = 2, position = ="/htdig/mx59pro/manual/"
url rejected: (level 1)http://80.1.1.4/htdig/mx59pro/manual/
A tag: pos = 2, position = ="content.txt"
*A tag: pos = 2, position = ="content.pdf"
*A