On Thursday, June 26, 2003, at 07:20 AM, Krzysztof Gorgolewski wrote:

At the beginning of June we noticed, that our index is getting too big.
It was usually 300-400mb. Now its swelled ower 3gb and we don't know
why. The size of all indexed files (html, pdf, ps, txt) is about 2.3gb.
The files we're indexing are not changed, and htdig don't hang-up while
indexing. It's even take 8-9 hours longer!!

Are you sure that there were no changes to any of the pages and no changes whatsoever to the directory structure? It is possible for a symbolic link or poorly formed hyperlink in a document to cause htdig to loop through a lot of bogus URLs, indexing some of the same documents over and over again. Simply adding a single link to a document also has the potential to pull in arbitrarily large portions of a site that were not previously indexed.


Are you certain that the start_url and limit_urls_to attributes have not changed in any way? Changes to either could allow more sites/directories to be indexed.

Are you reindexing from scratch, or performing updates? If the latter, it is possible that some sort of database corruption could be causing problems.

If you are indexing from scratch and can't think of anything else that has changed, you probably need to log the output of the dig and analyze it in order to determine where the problem might lie. If you are not already doing so, try running with the -s option to see if the number of indexed pages seems reasonable. You can also add one or more -v options in order to increase the verbosity of the output.

Jim



-------------------------------------------------------
This SF.Net email is sponsored by: INetU
Attention Web Developers & Consultants: Become An INetU Hosting Partner.
Refer Dedicated Servers. We Manage Them. You Get 10% Monthly Commission!
INetU Dedicated Managed Hosting http://www.inetu.net/partner/index.php
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to