According to David T. Ashley:
> What kind of site are you indexing?
> 
> How is the communication between the machine with the search engine and the
> machine hosting the site?
> 
> Are the server and search engine on the same machine?
> 
> What OS?

Jose Juilan Buda wrote:
> Did someone index more than 45,000 files with htdig ?
> i have a pentium III 933 , 256 RAM , 30 gb IDE , and
> when i run "rundig" to create the database ,  it take
> almost 10  hours to make the complete database from
> the begin.
> I increment the "timeout" apache parameter
> to...well...very high ,and this time it work , but
> take much time.
> Is that correct ?
> I hope that from now , just running htdig and htmerge
> , make the update and take no much time.

And earlier...
> Because i do have some problem with the digging
> proccess . I set now the "timeout" parameter to 10000
> because i think that is the problem because when i run
> "htdig -vvv" i see that the program lock waiting  in
> the 
> 
> "Retrieval command for
> http://myserver/mydirectory_to_index/ GET
> "/mydirectory_to_index/ HTTP/1.0...
> 
> and then it said..
> 
> "Header Line:
> returnStatus = 1
>  not found
> pick: myserver, # servers = 1"
> 
> but this directory in my webserver exist.
> 
> So ? It's an Apache configuration problem ?

Yes, these messages are consistent with htdig timing out while waiting
for an HTTP header from the server.  What isn't clear to me from your
messages is whether increasing the timeout to something large makes it
work correctly, or whether it's still failing.

If it's working correctly, I don't see what the problem is.  Certainly,
indexing 45,000 documents over HTTP is going to take quite a while, so
10 hours doesn't seem unreasonable.  You may be able to avoid the hangs
and timeouts by setting server_wait_time to something like 1 or 2, but
then it may take longer still to index the site, because of the pause
between documents fetched.

Once you've got a complete database of all your documents, updating it
with htdig (without the -i option) and htmerge should be quicker, as
it can quickly check which documents are unchanged, and it won't fetch
or reparse these.

David's question about whether the web server and htdig are on the same
machine is quite significant.  If they are, you can take advantage of
the local_urls feature to speed up indexing by bypassing the HTTP server
and fetching files right from the local filesystem.  This will be an
added benefit for update digs too, because checking for updated documents
will be very, very quick.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to