At 2:31 PM -0400 4/11/99, Jos Vos wrote:
>What are the practical and/or theoretical limits of ht://Dig?
>I'm trying to index a large site and after some time I always
>get something like "out of memory, 'new' failed" or so.

At 1:02 PM -0400 4/11/99, Tim Perdue, PHPBuilder.com wrote:
>My question is this: Will HTDIG hold up to 1.5 million pages? What about 2
>million?
>
>How in the world should I index all these emails and get decent performance
>when searching?

I'm going to try to answer both questions at the same time. The code itself
doesn't put any real limit on the number of pages. Right now, I know of
several sites in the hundreds of thousands of pages. I haven't heard bitter
complaints, so I assume they're fairly satisfied with performance. That's
not to say we're not working on improving memory requirements and
performance as much as possible. :-)

As for practical limits, I would say it depends a lot on how many pages you
plan on indexing. Some OS's limit files to 2GB in size, which can become a
problem with large DB. There are also slightly different limits to each of
the programs. Right now htmerge performs a sort on the words indexed. Most
sort programs use a fair amount of RAM and temporary disk space as they
assemble the sorted list. The htdig program stores a fair amount of
information about the URLs it visits, in part to only index a page once.
This takes a fair amount of RAM.

It's not a great answer, but with cheap RAM, it never hurts to throw more
memory at indexing larger sites. In a pinch, swap will work, but it
obviously really slows things down.

-Geoff

P.S. If this sounds vague, I'm sorry. I see no reason it can't hold 1.5
million pages, but you'll probably want to invest in a fair chunk of RAM.
(ht://Dig isn't alone in this respect, most large database servers I've
seen have huge amounts of RAM.)


------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.

Reply via email to