On Fri, 28 Sep 2001, Wolfgang Mueller wrote: > I did not get any reply regarding my crawling framework question. Are you > interested in this plugin mechanism, or do you have that already?
I'm sorry, I don't remember seeing anything about a crawling framework. I see you message now. When indexing, htdig doesn't visit "all the files" since there are a variety of restrictions that can be set whether to visit a particular file. (Including robots.txt, META robots tags, the ht://Dig configuration with a variety of regex methods...) Furthermore, at the moment, ht://Dig doesn't attempt to index images. It keeps a list of the URLs, but doesn't do much with them since it's a text indexing package. > I was thinking of shared libs that can be loaded on startup. The shared lib > to be used could be an option of wget. The GIFT would wrap this up in a small > shell script, making ugly things invisible to the user. ... > 1) if someone of you htdig/wget guys is doing that already > 2) if you are interested in me adding something like that to wget/htdig > or alternatively, if somebody volunteers... > 3) if someone would be willing or able to point me to the right places, > 4) how to do things in order to maximize the use for everybody. The new 3.2 development code for ht://Dig offers the ability to run transport protocols through a shell script. So you could certainly add a variety of hooks this way before passing the document back to htdig for finishing the indexing. For quite some time, ht://Dig has also had a system of external parser programs (and now "external converters") that are called to parse or translate the document. Again, I could see where you could add hooks through shell scripts here. > It would be much more practical, if we have some program which gets each > document, indexes it, and deletes the local copy of it, then gets the next > image etc. Unless you're calling an external parser or external converter, the documents are never written out to disk by ht://Dig. There have been a few requests to build up local caching, but right now it fetches documents, keeps them in memory as it parses/indexes them, and moves along. I don't know if this directly answers your questions, but it sounds like you can do what you want without needing much in the way of code changes. -- -Geoff Hutchison Williams Students Online http://wso.williams.edu/ _______________________________________________ htdig-dev mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/htdig-dev
