According to Steven C. Williams:
> We go into a specific directory on my server, in this
> case:
> http://64.130.230.53/syndicate
> There are a bunch of php includes which we create from
> feeds from different sites around the web; as far as
> what htdig does with those everything is great. 
> 
> Now here's where we want to go:
> We're trying to get htdig to make some record of the
> title words and urls of the articles that are coming
> thru this feed. Right now with current configuration
> of htdig.conf, we get links just to the syndicated php
> files on our server. When we run rundig -vvv, we get
> the following output (excerpt for one site):
[snip]
> What we'd like to do is to have these URL rejects make
> it into the htdig database but not crawl onto the
> sites themselves. Is that possible?

Maybe, but it might be tricky.  If these external URLs are always
a fixed number of hops away from the start_url, it's pretty easy.
Just add the names of the other hosts to limit_urls_to, or leave it
wide open with a pattern like "limit_urls_to: http://";, and set your
hop_count to prevent htdig from spidering down too deep.  If you're
running htdig-3.1.5, you'll probably need to install this patch to make
sure hop counts aren't corrupted, unless you never encounter a link to
a given external URL more than once:

    ftp://ftp.htdig.org/htdig-patches/3.1.5/hop_count.0

If the number of hops to these external URLs is not consistent, I
can't think of an easy way, unless you can somehow break it down into
chunks that are consistent, indexing them separately and then merging
them together.

> Furthermore, would it be possible, then, to accumulate
> that information without the redundant URLs piling up
> in the database?

I'm not sure I follow you here.  If you can keep the spidering in check,
then you won't get any redundant URLs piling up.  If htdig doesn't
crawl into the sites themselves, what would you consider to be redundant
URLs?  I don't see how this question is different from the one above.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to