I've found an instance where a document contains in robots.txt is
included in the final index. Not sure if this is a bug or a feature.
Here's the -vvvv running of my modified rundig script:
$ sudo ./rundig.merge
0:0:http://diary.unixathome.org/ottawa-pics.php
New server: diary.unixathome.org, 80
Retrieval command for http://diary.unixathome.org/robots.txt: GET
/robots.txt HTTP/1.0
User-Agent: htdig/3.1.5 ([EMAIL PROTECTED])
Host: diary.unixathome.org
Header line: HTTP/1.1 200 OK
Header line: Date: Tue, 01 Jan 2002 23:17:56 GMT
Header line: Server: Apache/1.3.22 (Unix) mod_throttle/3.1.2
mod_gzip/1.3.19.1a PHP/4.0.6
Header line: Last-Modified: Tue, 01 Jan 2002 22:57:30 GMT
Translated Tue, 01 Jan 2002 22:57:30 GMT to 2002-01-01 22:57:30 (102)
And converted to Tue, 01 Jan 2002 22:57:30
Header line: ETag: "cd75e-59-3c323eda"
Header line: Accept-Ranges: bytes
Header line: Content-Length: 89
Header line: Connection: close
Header line: Content-Type: text/plain
Header line:
returnStatus = 0
Read 89 from document
Read a total of 89 bytes
Parsing robots.txt file using myname = htdig
Robots.txt line: # robots.txt for http://www.freebsddiary.org/
Robots.txt line: User-agent: *
Found 'user-agent' line: *
Robots.txt line: Disallow: /ottawa-pics.php
Found 'disallow' line: /ottawa-pics.php
Pattern: /ottawa-pics.php
pushedrobots.txt: discarding 'http://diary.unixathome.org/ottawa-
pics.php', which = 0, length = 16
1:0:http://diary.unixathome.org/ottawa-pics.php skipped
pick: diary.unixathome.org, # servers = 1
htmerge: Merged URL: http://diary.unixathome.org/ottawa-pics.php
htmerge: Sorting...
htmerge: Merging...
htmerge: 100:02securityht
htmerge: 200:0targz
htmerge: 300:0x378
htmerge: 400:100mph
etc...
653/http://diary.unixathome.org/ottawa-pics.php
etc...
$ more rundig.merge
#!/bin/sh
#
# rundig
#
# $Id: rundig,v 1.1 2002/01/01 18:49:45 dan Exp $
#
# This is a sample script to create a search database for ht://Dig.
#
DBDIR=/usr/local/share/htdig/databases/freebsddiary/merge
CONFIG=/home/freebsddiary/configuration/htdig/htdig-freebsddiary.org.conf
CONFIGMERGE=/home/freebsddiary/configuration/htdig/htdig-
freebsddiary.org.merge.conf
COMMONDIR=/usr/local/share/htdig
BINDIR=/usr/local/bin
#
# Set the TMPDIR variable if you want htmerge to put files in a location
# other than the default. This is important if you do not have enough
# disk space for the big sort that htmerge runs. Also, be aware that
# on some systems, /tmp is a memory mapped filesystem that takes away
# from virtual memory.
#
TMPDIR=$DBDIR
export TMPDIR
$BINDIR/htdig -vvv -c ${CONFIGMERGE}
$BINDIR/htmerge -vvv -c ${CONFIG} -m ${CONFIGMERGE}
echo "done merge"
case "$alt" in
-a)
( cd $DBDIR && test -f db.docdb.work &&
for f in *.work
do
echo "moving $f to " . `basename $f .work`;
mv -f $f `basename $f .work`
done ) ;;
esac
--
Dan Langille
The FreeBSD Diary - http://freebsddiary.org/ - practical examples
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html