Hey all,
        
I mentioned this briefly in a reply to Geoff... here it is fleshed out for
pondering.

        http://www.httrack.com/

I did a little more investigation into httrack..

httrack is a neat and small web-site copier/spider.  It builds out of the
box as a library (libhttrack.so).  There is a supplied example app that
uses libhttrack for basic web-site copying.. about 150 lines long.  The
main httrack exe has a few more lines for various features.

So a short path to using httrack to get a feature like this would be to:

Httrack path:

in htdig
1. Substitute the call to the htdig internal retriever/transport calls
2. stream original URLs, local http-URL & local-disk filename to a logfile
4. add 'URL CacheURL' to the Document class and mifluz/DB 'schema'.
3. fire up retriever to process this log-file to parse & index the files
from the local-disk.. adding the original and cached urls to Document
object.

in htsearch
1. display cached URL to screen given config option.


Alternate path:

1. swipe httrack code for creating the directory structures on the
fly for storing the spidered web-sites
2. write retrieved documents to local files in Retriever (swipe httrack
code for localizing necessary page components?)
3. same changes to Document object.


One real long term advantage to using httrack or something similar would
be to offload the forward code maintenance of some of the htdig transport
code to another project.  Leaving htdig developers more time to work on
other features.  Of course this is an over simplification, choosing a
quickly changing and complicated site-copier could prove to be painful.


As it is now, I think that whipping up a version of htdig that processes a
log like the one described above (with the additions to Document
class & DB schema) would be pretty easy.  Users can run htdig with a
command line switch after running httrack.

I am not 100% clear that httrack does a great job localizing web-page
components... ie silly web-author using full http links to everything.  It
does web-http only.

There may be other spiders better than httrack... I used it a while back
to suck down a bunch of AI FAQs, worked flawlessly.  It is well reviewed
by users, for whatever that is worth.


-- 
Neal Richter 
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site



_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to