On 2 Jan 2002 at 14:45, Gilles Detillieux wrote:

> According to Dan Langille:
> > I've found an instance where a document contains in robots.txt is 
> > included in the final index.  Not sure if this is a bug or a feature.
> 
> It's a bug, but it's in your script...
> 
> ...
> > $ more rundig.merge
> > #!/bin/sh
> ...
> > $BINDIR/htdig   -vvv  -c ${CONFIGMERGE}
> > 
> > $BINDIR/htmerge -vvv  -c ${CONFIG} -m ${CONFIGMERGE}
> 
> This is the problem.  As I've mentioned many times on this list before,
> you can't go straight from htdig to htmerge -m.  You need to run htmerge
> in the standard way on the database from htdig before running htmerge -m.

[snip]

> You must insert
> 
>   $BINDIR/htmerge -vvv  -c ${CONFIGMERGE}
> 
> in your script after running htdig, and before running htmerge -m, to
> properly clean up the CONFIGMERGE database before merging it into your
> main one.

That fixed the problem.  Thanks.  For those trying the same thing, here 
is what to expect:

$ sudo sh index.merge.sh
        1:0:http://diary.unixathome.org/ottawa-pics.php
New server: diary.unixathome.org, 80
Retrieval command for http://diary.unixathome.org/robots.txt: GET 
/robots.txt HTTP/1.0
User-Agent: htdig/3.1.5 ([EMAIL PROTECTED])
Host: diary.unixathome.org

Header line: HTTP/1.1 200 OK
Header line: Date: Wed, 02 Jan 2002 20:56:56 GMT
Header line: Server: Apache/1.3.22 (Unix) mod_throttle/3.1.2 
mod_gzip/1.3.19.1a PHP/4.0.6
Header line: Last-Modified: Wed, 02 Jan 2002 20:06:14 GMT
Translated Wed, 02 Jan 2002 20:06:14 GMT to 2002-01-02 20:06:14 (102)
And converted to Wed, 02 Jan 2002 20:06:14
Header line: ETag: "cd75e-59-3c336836"
Header line: Accept-Ranges: bytes
Header line: Content-Length: 89
Header line: Connection: close
Header line: Content-Type: text/plain
Header line:
returnStatus = 0
Read 89 from document
Read a total of 89 bytes
Parsing robots.txt file using myname = htdig
Robots.txt line: # robots.txt for http://www.freebsddiary.org/
Robots.txt line: User-agent: *
Found 'user-agent' line: *
Robots.txt line: Disallow: /ottawa-pics.php
Found 'disallow' line: /ottawa-pics.php
Pattern: /ottawa-pics.php
 pushedrobots.txt: discarding 'http://diary.unixathome.org/ottawa-
pics.php', which = 0, length = 16

pick: diary.unixathome.org, # servers = 1
htmerge: Unable to open word list file 
'/usr/local/share/htdig/databases/freebsddiary/merge/db.wordlist'

htmerge: Unable to open word list file 
'/usr/local/share/htdig/databases/freebsddiary/merge/db.wordlist'

done merge
[dan@xeon:/home/freebsddiary/configuration/htdig] $

I'm quite sure that 'Unable to open word list file' is the result of 
having no documents to merge into the master database.

> I think that if I can't easily fix htmerge -m to deal with control
> records, I'll have to put a really big warning in the htmerge.html
> manual page about this.

I think that's a great idea.  While at it, I found the htmerge -m option 
confusing at first in terms of figuring out which database is the source 
and which is the destination configuration file.  Perhaps wording it like 
this would be better:

     -m merge_configfile
             Merge the databases specified by merge_configfile into the   
             databases specified by -c or the default.

Thanks.
-- 
Dan Langille
The FreeBSD Diary - http://freebsddiary.org/ - practical examples


_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to