Hey all, I mentioned this briefly in a reply to Geoff... here it is fleshed out for pondering.
http://www.httrack.com/ I did a little more investigation into httrack.. httrack is a neat and small web-site copier/spider. It builds out of the box as a library (libhttrack.so). There is a supplied example app that uses libhttrack for basic web-site copying.. about 150 lines long. The main httrack exe has a few more lines for various features. So a short path to using httrack to get a feature like this would be to: Httrack path: in htdig 1. Substitute the call to the htdig internal retriever/transport calls 2. stream original URLs, local http-URL & local-disk filename to a logfile 4. add 'URL CacheURL' to the Document class and mifluz/DB 'schema'. 3. fire up retriever to process this log-file to parse & index the files from the local-disk.. adding the original and cached urls to Document object. in htsearch 1. display cached URL to screen given config option. Alternate path: 1. swipe httrack code for creating the directory structures on the fly for storing the spidered web-sites 2. write retrieved documents to local files in Retriever (swipe httrack code for localizing necessary page components?) 3. same changes to Document object. One real long term advantage to using httrack or something similar would be to offload the forward code maintenance of some of the htdig transport code to another project. Leaving htdig developers more time to work on other features. Of course this is an over simplification, choosing a quickly changing and complicated site-copier could prove to be painful. As it is now, I think that whipping up a version of htdig that processes a log like the one described above (with the additions to Document class & DB schema) would be pretty easy. Users can run htdig with a command line switch after running httrack. I am not 100% clear that httrack does a great job localizing web-page components... ie silly web-author using full http links to everything. It does web-http only. There may be other spiders better than httrack... I used it a while back to suck down a bunch of AI FAQs, worked flawlessly. It is well reviewed by users, for whatever that is worth. -- Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site _______________________________________________ htdig-dev mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/htdig-dev