On Sep 21, 2004, at 6:16 PM, Jim wrote:

On Tue, 21 Sep 2004, Aaron wrote:

How long should the initial database build take (ball park)? The machine is a Dual G4 1.2Ghz, and we are talking over 500,000 documents. Is this an hour? two? twelve? days? After I sent the email last night, I started a fresh rundig and included all of the documents, and it's still running htdig, it hasn't even gotten to htnotify, so it's been running about 8 hours. Is this to be expected?

I don't find it surprising for that many documents. However providing even a ball park figure for the total indexing time is difficult due to the number of factors involved. Document size, network performance, and htdig configuration settings can all have a major impact on the time required. The amount of RAM can also make a huge difference if there is not enough
to avoid swapping.

The server has 2Gb of RAM, and the size of the documents are small, they are all e-mails, so I would guess mostly under 100k.


In the future you might try supplying a -v which will give you a little
feedback regarding progress. The only other thing I can think of to
suggest is that you try running against some smaller, representative
samples of the full document collection. You might be able to extrapolate
something useful from that.

I gave the -vvv flags, and ran the rundig against the first 100 documents and saw everything finish. It finished in roughly 1 minute. Based on this, I am guessing 500,000 documents / 100 documents per minute = 5000 minutes. So this will take roughly 3 and a half days to index? If that is the case, does that mean each time I update the index via cron it will take this long, or will it be differential?


Also, with the full 500,000 documents, with the -vvv flags, I haven't seen any progress after 10 hours.

Also be aware that there is a 2 GB limit on the size of some of the files
involved with index creation. Exceeding that limit will kill your indexing
run.

How can I guess how big the files will get?

Thanks again for the help!

-=Aaron



-------------------------------------------------------
This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
Project Admins to receive an Apple iPod Mini FREE for your judgement on
who ports your project to Linux PPC the best. Sponsored by IBM.
Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php
_______________________________________________
ht://Dig general mailing list: <[EMAIL PROTECTED]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

Reply via email to