According to David T. Ashley: > What kind of site are you indexing? > > How is the communication between the machine with the search engine and the > machine hosting the site? > > Are the server and search engine on the same machine? > > What OS?
Jose Juilan Buda wrote: > Did someone index more than 45,000 files with htdig ? > i have a pentium III 933 , 256 RAM , 30 gb IDE , and > when i run "rundig" to create the database , it take > almost 10 hours to make the complete database from > the begin. > I increment the "timeout" apache parameter > to...well...very high ,and this time it work , but > take much time. > Is that correct ? > I hope that from now , just running htdig and htmerge > , make the update and take no much time. And earlier... > Because i do have some problem with the digging > proccess . I set now the "timeout" parameter to 10000 > because i think that is the problem because when i run > "htdig -vvv" i see that the program lock waiting in > the > > "Retrieval command for > http://myserver/mydirectory_to_index/ GET > "/mydirectory_to_index/ HTTP/1.0... > > and then it said.. > > "Header Line: > returnStatus = 1 > not found > pick: myserver, # servers = 1" > > but this directory in my webserver exist. > > So ? It's an Apache configuration problem ? Yes, these messages are consistent with htdig timing out while waiting for an HTTP header from the server. What isn't clear to me from your messages is whether increasing the timeout to something large makes it work correctly, or whether it's still failing. If it's working correctly, I don't see what the problem is. Certainly, indexing 45,000 documents over HTTP is going to take quite a while, so 10 hours doesn't seem unreasonable. You may be able to avoid the hangs and timeouts by setting server_wait_time to something like 1 or 2, but then it may take longer still to index the site, because of the pause between documents fetched. Once you've got a complete database of all your documents, updating it with htdig (without the -i option) and htmerge should be quicker, as it can quickly check which documents are unchanged, and it won't fetch or reparse these. David's question about whether the web server and htdig are on the same machine is quite significant. If they are, you can take advantage of the local_urls feature to speed up indexing by bypassing the HTTP server and fetching files right from the local filesystem. This will be an added benefit for update digs too, because checking for updated documents will be very, very quick. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

