On 2 Jan 2002 at 14:45, Gilles Detillieux wrote:
> According to Dan Langille:
> > I've found an instance where a document contains in robots.txt is
> > included in the final index. Not sure if this is a bug or a feature.
>
> It's a bug, but it's in your script...
>
> ...
> > $ more rundig.merge
> > #!/bin/sh
> ...
> > $BINDIR/htdig -vvv -c ${CONFIGMERGE}
> >
> > $BINDIR/htmerge -vvv -c ${CONFIG} -m ${CONFIGMERGE}
>
> This is the problem. As I've mentioned many times on this list before,
> you can't go straight from htdig to htmerge -m. You need to run htmerge
> in the standard way on the database from htdig before running htmerge -m.
[snip]
> You must insert
>
> $BINDIR/htmerge -vvv -c ${CONFIGMERGE}
>
> in your script after running htdig, and before running htmerge -m, to
> properly clean up the CONFIGMERGE database before merging it into your
> main one.
That fixed the problem. Thanks. For those trying the same thing, here
is what to expect:
$ sudo sh index.merge.sh
1:0:http://diary.unixathome.org/ottawa-pics.php
New server: diary.unixathome.org, 80
Retrieval command for http://diary.unixathome.org/robots.txt: GET
/robots.txt HTTP/1.0
User-Agent: htdig/3.1.5 ([EMAIL PROTECTED])
Host: diary.unixathome.org
Header line: HTTP/1.1 200 OK
Header line: Date: Wed, 02 Jan 2002 20:56:56 GMT
Header line: Server: Apache/1.3.22 (Unix) mod_throttle/3.1.2
mod_gzip/1.3.19.1a PHP/4.0.6
Header line: Last-Modified: Wed, 02 Jan 2002 20:06:14 GMT
Translated Wed, 02 Jan 2002 20:06:14 GMT to 2002-01-02 20:06:14 (102)
And converted to Wed, 02 Jan 2002 20:06:14
Header line: ETag: "cd75e-59-3c336836"
Header line: Accept-Ranges: bytes
Header line: Content-Length: 89
Header line: Connection: close
Header line: Content-Type: text/plain
Header line:
returnStatus = 0
Read 89 from document
Read a total of 89 bytes
Parsing robots.txt file using myname = htdig
Robots.txt line: # robots.txt for http://www.freebsddiary.org/
Robots.txt line: User-agent: *
Found 'user-agent' line: *
Robots.txt line: Disallow: /ottawa-pics.php
Found 'disallow' line: /ottawa-pics.php
Pattern: /ottawa-pics.php
pushedrobots.txt: discarding 'http://diary.unixathome.org/ottawa-
pics.php', which = 0, length = 16
pick: diary.unixathome.org, # servers = 1
htmerge: Unable to open word list file
'/usr/local/share/htdig/databases/freebsddiary/merge/db.wordlist'
htmerge: Unable to open word list file
'/usr/local/share/htdig/databases/freebsddiary/merge/db.wordlist'
done merge
[dan@xeon:/home/freebsddiary/configuration/htdig] $
I'm quite sure that 'Unable to open word list file' is the result of
having no documents to merge into the master database.
> I think that if I can't easily fix htmerge -m to deal with control
> records, I'll have to put a really big warning in the htmerge.html
> manual page about this.
I think that's a great idea. While at it, I found the htmerge -m option
confusing at first in terms of figuring out which database is the source
and which is the destination configuration file. Perhaps wording it like
this would be better:
-m merge_configfile
Merge the databases specified by merge_configfile into the
databases specified by -c or the default.
Thanks.
--
Dan Langille
The FreeBSD Diary - http://freebsddiary.org/ - practical examples
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html