Thanks for the reply.  I am now digging thru the htdig site ;) again.

If anyone knows where I can find the correct format for the headers it
would be much appreciated.

-Rylan

-----Original Message-----
From: Matthew Nuzum [mailto:[EMAIL PROTECTED]] 
Sent: Tuesday, July 02, 2002 10:48 AM
To: 'Rylan W. Hazelton'; [EMAIL PROTECTED]
Subject: RE: [htdig] htdig 3.2 LARGE site


-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]] On Behalf Of Rylan W.
Hazelton
Sent: Tuesday, July 02, 2002 1:12 PM
To: [EMAIL PROTECTED]
Subject: [htdig] htdig 3.2 LARGE site

I have written a set of scripts for htdig to read through all of the
posts in my Vbulletin forum.  It allows htdig to view every post on its
own page.  I then rewrite the urls so that the user sees them in the
"pretty" form when they do a search.

My problem is that my forum has almost 1M posts.  Which means that that
is 1M pages that htdig has to index.

I let it run for about 8hrs and it only dug about 20% of them.  I need
to find a way to make the indexing more palatable to the server and was
hoping someone can help me here.

Options I have considered.

1) Run a big dig (all 1M posts) then, run nightly digs of the posts in
the last 24-36 hours, then merge the dbs.

2) break the posts up into ~50-100k page block and index them all
separately, then merge the dbs.


How do you guys update your dbs?  Do I need to reindex them all every
time?

Please help.

Also how can I search multiple dbs at once in 3.2?  Are there any docs
for 3.2?

Thanks

-Rylan


I have encountered a similar problem.  I now index only 3 times per week
because of it.  If I try to index more often the new one starts before
the old one finishes and it corrupts the database.

I can't propose a perfect solution to your problem, but I can tell you
what I've found in my troubleshooting.

A big problem is that pages created dynamically with PHP or other cgi
really don't send the header needed by ht://dig to detect if the page
changed or not.  This means that the search engine does a complete
re-index every time.

Since it sounds like you've done some custom coding here, you may want
to consider adding this header to your output.  For example, if the
database keeps track of what times the posts where generated, you could
just format that date and time properly and send it in the appropriate
header.  If the data hasn't changed htdig can skip that page.

Your first index will still take a long time, but subsequent indexes
will go faster.

I don't remember the exact format of the header you need, but it's well
documented and I'm sure someone on the list can give a pointer.

Matthew Nuzum
[EMAIL PROTECTED]
www.bearfruit.org






-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to