[htdig] digging a web application

Jerry Asher Wed, 20 Jun 2001 02:36:40 -0700
So I am building an htDig solution for the OpenACS....  The OpenACS is 
typical I suspect of many web applications.  It contains content you want 
to index, and lots of links to content you don't.

If you examine http://openacs.org/wp/ you can get an idea of what I 
mean.  /wp/ is Wimpypoint, a netbased "powerpoint" equivalent.  Clearly, I 
want to index the presentations found under /wp/, but I don't want to walk 
down all of the navigational links: my presentations, last two week 
presentations, last month's presentations, etc.  I just want to go down the 
show "all of everyone's" link.

1. One strategy is to use htDig in it's typical mode, and use things like 
exclude patterns to keep htDig on track.  One problem with that that is 
OpenACS specific (but maybe general to similar web apps) is that the dates 
are wrong.  The webserver returns the current date and time of these db 
generated presentations, and doesn't return the date and time the 
presentation was created.

2. Another strategy would be to create some sort of application specific 
database specific indexing tool.    Presumably this would scour the db 
tools that some application is using and create some sort of htDig friendly 
results file.  Has anyone done anything like this?  How might this work?

Finally, I wonder if I could combine strategies 1 and 2 by writing a 
converter or external parser.  I imagine what I could do is create a config 
file unique to each web app, and what that config file might do is specify 
a web app specific converter to convert from text/html to well, 
text/html.  Internally the converter would presumably get a page from the 
server and somehow rewrite it so that htDig would get just the right pages, 
and would get the right dates for those pages.  Could this strategy work?

What have folks done to index dynamic content on db driven pages?

Thanks,

Jerry

P.S. If you care to see the search interface I have working, you can take a 
look at it at http://www.theashergroup.com/  You can see how I am 
experimenting with the OpenACS.ORG indexer by visiting 
http://www.theashergroup.com/demos/openacs.  This is mainly implemented 
with one tcl script and a rewritten wrapper.html.  I took wrapper.html and 
rewrote it as an AOLserver ADP page (like an ASP page).  I exec to 
htsearch.  htsearch runs the search using wrapper.adp to create the 
output.  I take the output from htsearch and have aolserver interpret it as 
an adp page.  The results are returned to the user.
=====================================================
Jerry Asher                       [EMAIL PROTECTED]
1678 Shattuck Avenue Suite 161    Tel: (510) 549-2980
Berkeley, CA 94709                Fax: (877) 311-8688


_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html
[htdig] digging a web application

Reply via email to