Thank you for the suggestion about the .work files. I'll confirm my
Apache 1.3.1 supports the If-Modified-Since header and is configured
correctly in that respect.
>But I've got a DB of about 50,000, so only indexing daily changes (a
>few hundred pages maybe) is a big speedup.
I've got about 130,000 pages very unevenly split over about 180 htdig
databases and am growing by about 1000 pages per day, and may grow
much faster soon. I didn't realize that removing the .work files was
killing incremental indexing - I'll check how it helps during a full
scale test.
> [...] but benchmarks will vary considerably, and O(n) isn't useful.
I must strongly disagree. O(n) would be quite helpful. I'd use them
when planning hardware capacity and making time vs space tradeoffs in
configuration. It affects decisions such as:
* whether to leave .work files around.
* whether to use -a, -i
* judging how close I am to resource limits
* deciding whether to buy more storage or more CPU power or more bandwidth
* deciding whether to manually tell htdig about new pages
or letting it look for them itself.
These decisions would be much easier if I could benchmark my current
setup's peformance and had O[n] performance estimates. It's the
difference between guessing at bottlenecks and making more intelligent
decisions.
I set up a table below which would probably be enough to answer just
about any performance / scaling question. I filled the very few values
that seem obvious to me, and would love to know the rest. Any takers?
Jeff
Performance estimates, htdig
n = total number of messages indexed
============================================================================
Time RAM Disk Disk Bandwidth
(final) (peak)
----------------------------------------------------------------------------
Initial indexing
----------------------------------------------------------------------------
htdig ??? ??? ??? ??? O[n]
------------------------------------------------------------------------
htmerge ??? ??? ??? ??? 0
----------------------------------------------------------------------------
Second indexing (no changes to data)
----------------------------------------------------------------------------
htdig ??? ??? ??? ??? ???
------------------------------------------------------------------------
htmerge ??? ??? ??? ??? 0
----------------------------------------------------------------------------
Third indexing (one piece of data has changed)
----------------------------------------------------------------------------
htdig ??? ??? ??? ??? ???
------------------------------------------------------------------------
htmerge ??? ??? ??? ??? 0
----------------------------------------------------------------------------
Fourth indexing (all data has changed)
----------------------------------------------------------------------------
htdig ??? ??? ??? ??? ???
------------------------------------------------------------------------
htmerge ??? ??? ??? ??? 0
----------------------------------------------------------------------------
Additional Notes:
Data above assume not use of -i or -a
-i will make everything perform like initial indexing
-a will double final disk requirements, plus add O[n] time
for copying .work files. (probably with a low constant)
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.