Re: [htdig-dev] Crawling

Geoff Hutchison Fri, 28 Sep 2001 08:36:06 -0700

On Fri, 28 Sep 2001, Wolfgang Mueller wrote:

> I did not get any reply regarding my crawling framework question. Are you 
> interested in this plugin mechanism, or do you have that already?


I'm sorry, I don't remember seeing anything about a crawling framework. I
see you message now.

When indexing, htdig doesn't visit "all the files" since there are a
variety of restrictions that can be set whether to visit a particular
file. (Including robots.txt, META robots tags, the ht://Dig configuration
with a variety of regex methods...)

Furthermore, at the moment, ht://Dig doesn't attempt to index images. It
keeps a list of the URLs, but doesn't do much with them since it's a text
indexing package.

> I was thinking of shared libs that can be loaded on startup. The shared lib 
> to be used could be an option of wget. The GIFT would wrap this up in a small 
> shell script, making ugly things invisible to the user.
...
> 1) if someone of you htdig/wget guys is doing that already
> 2) if you are interested in me adding something like that to wget/htdig
> or alternatively, if somebody volunteers...
> 3) if someone would be willing or able to point me to the right places,
> 4) how to do things in order to maximize the use for everybody.

The new 3.2 development code for ht://Dig offers the ability to run
transport protocols through a shell script. So you could certainly add a
variety of hooks this way before passing the document back to htdig for
finishing the indexing.

For quite some time, ht://Dig has also had a system of external parser
programs (and now "external converters") that are called to parse or
translate the document. Again, I could see where you could add hooks
through shell scripts here.

> It would be much more practical, if we have some program which gets each 
> document, indexes it, and deletes the local copy of it, then gets the next 
> image etc.

Unless you're calling an external parser or external converter, the
documents are never written out to disk by ht://Dig. There have been a few
requests to build up local caching, but right now it fetches documents,
keeps them in memory as it parses/indexes them, and moves along.

I don't know if this directly answers your questions, but it sounds like
you can do what you want without needing much in the way of code changes.

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/




_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Re: [htdig-dev] Crawling

Reply via email to