Thanks for the reply. I am now digging thru the htdig site ;) again. If anyone knows where I can find the correct format for the headers it would be much appreciated.
-Rylan -----Original Message----- From: Matthew Nuzum [mailto:[EMAIL PROTECTED]] Sent: Tuesday, July 02, 2002 10:48 AM To: 'Rylan W. Hazelton'; [EMAIL PROTECTED] Subject: RE: [htdig] htdig 3.2 LARGE site -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Rylan W. Hazelton Sent: Tuesday, July 02, 2002 1:12 PM To: [EMAIL PROTECTED] Subject: [htdig] htdig 3.2 LARGE site I have written a set of scripts for htdig to read through all of the posts in my Vbulletin forum. It allows htdig to view every post on its own page. I then rewrite the urls so that the user sees them in the "pretty" form when they do a search. My problem is that my forum has almost 1M posts. Which means that that is 1M pages that htdig has to index. I let it run for about 8hrs and it only dug about 20% of them. I need to find a way to make the indexing more palatable to the server and was hoping someone can help me here. Options I have considered. 1) Run a big dig (all 1M posts) then, run nightly digs of the posts in the last 24-36 hours, then merge the dbs. 2) break the posts up into ~50-100k page block and index them all separately, then merge the dbs. How do you guys update your dbs? Do I need to reindex them all every time? Please help. Also how can I search multiple dbs at once in 3.2? Are there any docs for 3.2? Thanks -Rylan I have encountered a similar problem. I now index only 3 times per week because of it. If I try to index more often the new one starts before the old one finishes and it corrupts the database. I can't propose a perfect solution to your problem, but I can tell you what I've found in my troubleshooting. A big problem is that pages created dynamically with PHP or other cgi really don't send the header needed by ht://dig to detect if the page changed or not. This means that the search engine does a complete re-index every time. Since it sounds like you've done some custom coding here, you may want to consider adding this header to your output. For example, if the database keeps track of what times the posts where generated, you could just format that date and time properly and send it in the appropriate header. If the data hasn't changed htdig can skip that page. Your first index will still take a long time, but subsequent indexes will go faster. I don't remember the exact format of the header you need, but it's well documented and I'm sure someone on the list can give a pointer. Matthew Nuzum [EMAIL PROTECTED] www.bearfruit.org ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

