On Thu, 5 Jul 2001, Gilles Detillieux wrote:
> Date: Thu, 5 Jul 2001 12:29:21 -0500 (CDT)
> From: Gilles Detillieux <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED]
> Subject: Re: [htdig] fine tuning search to allow rejected urls
>
> According to Steven C. Williams:
> > We go into a specific directory on my server, in this
> > case:
> > http://64.130.230.53/syndicate
> > There are a bunch of php includes which we create from
> > feeds from different sites around the web; as far as
> > what htdig does with those everything is great.
> >
> > Now here's where we want to go:
> > We're trying to get htdig to make some record of the
> > title words and urls of the articles that are coming
> > thru this feed. Right now with current configuration
> > of htdig.conf, we get links just to the syndicated php
> > files on our server. When we run rundig -vvv, we get
> > the following output (excerpt for one site):
> [snip]
> > What we'd like to do is to have these URL rejects make
> > it into the htdig database but not crawl onto the
> > sites themselves. Is that possible?
>
> Maybe, but it might be tricky. If these external URLs are always
> a fixed number of hops away from the start_url, it's pretty easy.
> Just add the names of the other hosts to limit_urls_to, or leave it
> wide open with a pattern like "limit_urls_to: http://", and set your
> hop_count to prevent htdig from spidering down too deep. If you're
> running htdig-3.1.5, you'll probably need to install this patch to make
> sure hop counts aren't corrupted, unless you never encounter a link to
> a given external URL more than once:
>
> ftp://ftp.htdig.org/htdig-patches/3.1.5/hop_count.0
^^^^^
I think you meant:
ftp://ftp.ccsf.org/htdig-patches/3.1.5/hop_count.0
Regards,
Joe
--
_/ _/_/_/ _/ ____________ __o
_/ _/ _/ _/ ______________ _-\<,_
_/ _/ _/_/_/ _/ _/ ......(_)/ (_)
_/_/ oe _/ _/. _/_/ ah [EMAIL PROTECTED]
> If the number of hops to these external URLs is not consistent, I
> can't think of an easy way, unless you can somehow break it down into
> chunks that are consistent, indexing them separately and then merging
> them together.
>
> > Furthermore, would it be possible, then, to accumulate
> > that information without the redundant URLs piling up
> > in the database?
>
> I'm not sure I follow you here. If you can keep the spidering in check,
> then you won't get any redundant URLs piling up. If htdig doesn't
> crawl into the sites themselves, what would you consider to be redundant
> URLs? I don't see how this question is different from the one above.
>
> --
> Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
> Dept. Physiology, U. of Manitoba Phone: (204)789-3766
> Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html