Re: [htdig] Options to htdig

2000-12-30 Thread Geoff Hutchison

On Fri, 29 Dec 2000, Douglas Kline wrote:

 I've been running some tests and my results don't seem consistent with your
 description.

I think a better way of saying this is "my description wasn't very good."

Let me put it this way:
-a = "add .work to the database names before doing anything"
-i = "delete files before starting"

The -a flag is performed before the -i. So yes, if you use -a -i, it will
create new files with the .work extension--but if there were already .work
files existing, it would delete them before doing anything.

 If you don't use "-i", then how does htdig use an old database?

If the database (with or without the .work extension) exists and you don't
supply the -i, htdig will read in all the URLs from the database before
starting. Before retrieving the URL from the server, it will send a header
telling the server to only send the document if it's changed. It will also
check the date of the document and only index it if it's newer than what's
in the DB already.

 rundig will indeed rename the ".work" files to their basenames but
 only if "-a" is given as an option on the command line.  If rundig is
 edited to put the "-a" flag on the htdig command line within the
 script, then the "-a" won't become part of the variable alt and the
 script won't execute the lines which rename the files.

IMHO, the rundig script is pretty well commented--I didn't think it needed
a whole lot of explanation that the files will only be renamed if $alt is
set. Just my $0.02

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




[htdig] PDF problems

2000-12-30 Thread The Melia Family

Hello,

I am using HTDIG 3.1.5 on Redhat 7.0, and am having problems indexing PDF
files. I have included my config  -vv output below.  I have no robots.txt
file, and my max_doc_size is now 10M (one test .pdf file is only 27K and it
also fails), as well as not rejecting pdf as an extension.
I am using the latest xpdf with pdftotext, as well as the latest parse_doc
and conv_doc scripts.

I can manually pdftotext the pdf files and they do contain real text, not
just images, I can also run parse_doc and conv_doc.plthey produce proper
text.  WHen I do a rundig, I get a 'URL rejected' message, I do not know
why, this (I presume) leads to a Deleted No Excerpt message and the file (or
any pdf file) is not indexed.  Any suggestions??

Regards,
Tony

___BELOW is my CONFIG 

external_parsers: application/msword /usr/bin/parse_doc.pl \
  application/postscript /usr/bin/parse_doc.pl \
  application/pdf /usr/bin/parse_doc.pl

database_dir:   /data/software/htdigdb

local_urls:  http://80.1.1.4/=/var/www/html/

start_url:  http://80.1.1.4/htdig/

limit_urls_to:  ${start_url}

exclude_urls:   /cgi-bin/ .cgi

bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif
.iso\
.jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov
.avi

maintainer: [EMAIL PROTECTED]

max_head_length:5

max_doc_size:   1000

no_excerpt_show_top:true

search_algorithm:   exact:1 synonyms:0.5 endings:0.1


no_next_page_text:
no_prev_page_text:

Below is output of rundig -vv using 2 pdf files and 1 txt and
files __

New server: 80.1.1.4, 80
Trying local files
  tried local file /var/www/html/robots.txt
Local retrieval failed, trying HTTP
pick: 80.1.1.4, # servers = 1
0:0:0:http://80.1.1.4/htdig/mx59pro/manual/english/: Trying local files
  tried local file /var/www/html/htdig/mx59pro/manual/english/index.html
Local retrieval failed, trying HTTP

title: Index of /htdig/mx59pro/manual/english
A tag: pos = 2, position = ="?N=D"

   pushing http://80.1.1.4/htdig/mx59pro/manual/english/?N=D
+A tag: pos = 2, position = ="?M=A"

   pushing http://80.1.1.4/htdig/mx59pro/manual/english/?M=A
+A tag: pos = 2, position = ="?S=A"

   pushing http://80.1.1.4/htdig/mx59pro/manual/english/?S=A
+A tag: pos = 2, position = ="?D=A"

   pushing http://80.1.1.4/htdig/mx59pro/manual/english/?D=A
+A tag: pos = 2, position = ="/htdig/mx59pro/manual/"

url rejected: (level 1)http://80.1.1.4/htdig/mx59pro/manual/
A tag: pos = 2, position = ="content.pdf"

   pushing http://80.1.1.4/htdig/mx59pro/manual/english/content.pdf
+A tag: pos = 2, position = ="content.txt"

   pushing http://80.1.1.4/htdig/mx59pro/manual/english/content.txt
+A tag: pos = 2, position = ="sonic.pdf"

   pushing http://80.1.1.4/htdig/mx59pro/manual/english/sonic.pdf
+ size = 954
pick: 80.1.1.4, # servers = 1
1:1:1:http://80.1.1.4/htdig/mx59pro/manual/english/?N=D: Trying local files
  tried local file /var/www/html/htdig/mx59pro/manual/english/?N=D
Local retrieval failed, trying HTTP

title: Index of /htdig/mx59pro/manual/english
A tag: pos = 2, position = ="?N=A"

   pushing http://80.1.1.4/htdig/mx59pro/manual/english/?N=A
+A tag: pos = 2, position = ="?M=A"
*A tag: pos = 2, position = ="?S=A"
*A tag: pos = 2, position = ="?D=A"
*A tag: pos = 2, position = ="/htdig/mx59pro/manual/"

url rejected: (level 1)http://80.1.1.4/htdig/mx59pro/manual/
A tag: pos = 2, position = ="sonic.pdf"
*A tag: pos = 2, position = ="content.txt"
*A tag: pos = 2, position = ="content.pdf"
* size = 954
pick: 80.1.1.4, # servers = 1
2:2:1:http://80.1.1.4/htdig/mx59pro/manual/english/?M=A: Trying local files
  tried local file /var/www/html/htdig/mx59pro/manual/english/?M=A
Local retrieval failed, trying HTTP

title: Index of /htdig/mx59pro/manual/english
A tag: pos = 2, position = ="?N=A"
*A tag: pos = 2, position = ="?M=D"

   pushing http://80.1.1.4/htdig/mx59pro/manual/english/?M=D
+A tag: pos = 2, position = ="?S=A"
*A tag: pos = 2, position = ="?D=A"
*A tag: pos = 2, position = ="/htdig/mx59pro/manual/"

url rejected: (level 1)http://80.1.1.4/htdig/mx59pro/manual/
A tag: pos = 2, position = ="content.pdf"
*A tag: pos = 2, position = ="sonic.pdf"
*A tag: pos = 2, position = ="content.txt"
* size = 954
pick: 80.1.1.4, # servers = 1
3:3:1:http://80.1.1.4/htdig/mx59pro/manual/english/?S=A: Trying local files
  tried local file /var/www/html/htdig/mx59pro/manual/english/?S=A
Local retrieval failed, trying HTTP

title: Index of /htdig/mx59pro/manual/english
A tag: pos = 2, position = ="?N=A"
*A tag: pos = 2, position = ="?M=A"
*A tag: pos = 2, position = ="?S=D"

   pushing http://80.1.1.4/htdig/mx59pro/manual/english/?S=D
+A tag: pos = 2, position = ="?D=A"
*A tag: pos = 2, position = ="/htdig/mx59pro/manual/"

url rejected: (level 1)http://80.1.1.4/htdig/mx59pro/manual/
A tag: pos = 2, position = ="content.txt"
*A tag: pos = 2, position = ="content.pdf"
*A