Re: [htdig] fine tuning search to allow rejected urls

Joe R. Jah Thu, 05 Jul 2001 12:15:11 -0700
On Thu, 5 Jul 2001, Gilles Detillieux wrote:

> Date: Thu, 5 Jul 2001 12:29:21 -0500 (CDT)
> From: Gilles Detillieux <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED]
> Subject: Re: [htdig] fine tuning search to allow rejected urls
> 
> According to Steven C. Williams:
> > We go into a specific directory on my server, in this
> > case:
> > http://64.130.230.53/syndicate
> > There are a bunch of php includes which we create from
> > feeds from different sites around the web; as far as
> > what htdig does with those everything is great. 
> > 
> > Now here's where we want to go:
> > We're trying to get htdig to make some record of the
> > title words and urls of the articles that are coming
> > thru this feed. Right now with current configuration
> > of htdig.conf, we get links just to the syndicated php
> > files on our server. When we run rundig -vvv, we get
> > the following output (excerpt for one site):
> [snip]
> > What we'd like to do is to have these URL rejects make
> > it into the htdig database but not crawl onto the
> > sites themselves. Is that possible?
> 
> Maybe, but it might be tricky.  If these external URLs are always
> a fixed number of hops away from the start_url, it's pretty easy.
> Just add the names of the other hosts to limit_urls_to, or leave it
> wide open with a pattern like "limit_urls_to: http://";, and set your
> hop_count to prevent htdig from spidering down too deep.  If you're
> running htdig-3.1.5, you'll probably need to install this patch to make
> sure hop counts aren't corrupted, unless you never encounter a link to
> a given external URL more than once:
> 
>     ftp://ftp.htdig.org/htdig-patches/3.1.5/hop_count.0
                ^^^^^
I think you meant:

      ftp://ftp.ccsf.org/htdig-patches/3.1.5/hop_count.0

Regards,

Joe
-- 
     _/   _/_/_/       _/              ____________    __o
     _/   _/   _/      _/         ______________     _-\<,_
 _/  _/   _/_/_/   _/  _/                     ......(_)/ (_)
  _/_/ oe _/   _/.  _/_/ ah        [EMAIL PROTECTED]

> If the number of hops to these external URLs is not consistent, I
> can't think of an easy way, unless you can somehow break it down into
> chunks that are consistent, indexing them separately and then merging
> them together.
> 
> > Furthermore, would it be possible, then, to accumulate
> > that information without the redundant URLs piling up
> > in the database?
> 
> I'm not sure I follow you here.  If you can keep the spidering in check,
> then you won't get any redundant URLs piling up.  If htdig doesn't
> crawl into the sites themselves, what would you consider to be redundant
> URLs?  I don't see how this question is different from the one above.
> 
> -- 
> Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
> Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
> Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
> Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930


_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html
Re: [htdig] fine tuning search to allow rejected urls

Reply via email to