Re: [htdig] premature merging

campbel Fri, 11 Aug 2000 12:04:23 -0700

According to Geoff Hutchison:
> On Fri, 11 Aug 2000 [EMAIL PROTECTED] wrote:
> > syntax in the config files so I know that it isn't that. I'm not sure
> > if it makes a difference but these start URL's all contain /cgi-bin/ and the
> 
> I'd make sure you've set the exclude_urls appropriately. Remember that the
> default is to exclude cgi-bin.

My exclude_urls is set to .gif

>Also check limit_urls_to.  By default, it takes on the value of start_url,
>which won't do if you list very specific URLs in this parameter, because
>your limit_urls_to won't be open-ended enough to allow other URLs.

As an example, all of the URL's in my start_url look similar to

http://www.foo.ca/cgi-bin/foo2/foo3/foo4/rp_tocs_e?bcb_bcb3-00_78

except that the remaining part after the ? changes 

and that page links you to several URL's that look like

http://different.server.ca/cgi-bin/blah/blah/blah/ViewDoc?journal=one&volume=2&file=3.pdf

where the info after the ? changes.

My limit_urls_to attribute looks like 
http://www.foo.ca/cgi-bin/foo2/foo3/foo4/rp_tocs_e? \
http://different.server.ca/cgi-bin/blah/blah/blah/RPViewDoc

so I can't see a problem with that. The strange thing here is that it
goes through about 15 of the 50 start_url URLs and then merges. It
seems to me that htdig thinks that it is finished digging for some
reason and I can't pinpoint the reason why.

>So one way to get more information on this
>is to run htdig by itself and add the -vvvv flag for more debugging
>information.

I ran the dig with -vvv and the output seemed fine, it was following
all links, indexing the pdf's,  and parsing them perfectly. 

I'm stumped,
Sheri

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
Re: [htdig] premature merging

Reply via email to