In my experience, anything smaller than 80% is pretty good.

I think it's a dreadful mistake to attempt to compress indexes at the 
expense of searchability.   Hard drives are so incredibly cheap these 
days, it's like cutting off your nose to make your face smaller.

Remember, for each of those 700 stop words, anyone doing an "and" 
search or a forthcoming phrase search will get zero results -- that's 
frustrating.  Google has just announced they're not rejecting 
stopwords any more (they always indexed them, they just wouldn't 
search by default).

And by truncating your head_length, you're reducing the 
meta-information displayed in the search results, making them less 
useful.

Perhaps there are better uses for your energy than saving a few more 
MB of disk space.

Avi

At 1:25 AM -0800 12/4/01, Bob Stayton wrote:
>Despite my efforts to reduce the size of my htdig index
>files, they seem kind of big, so I thought I would ask if
>they are out of line.  These are all HTML files on a single
>website.  Here are my statistics:
>
>Total documents:     59,546
>Total doc size:      482 MB
>db.docdb:             36 MB
>db.words.db          227 MB
>db.docs.index          5 MB
>
>The index files are 56% the size of the doc collection.
>Is this unusual?
>
>And this is after trying to reduce the
>index size by:
>
>- adding a 700 word bad_word_list.
>- setting max_head_length to only 50
>- adding 18 of the most common entries for common_url_parts
>
>I think it is marvelous that htdig handled this much
>content.  I'm just wondering if I'm missing something
>that might reduce the size of the index files.
>
>bobs
>Bob Stayton                                 400 Encinal Street
>Publications Architect                      Santa Cruz, CA  95060
>Technical Publications                      voice: (831) 427-7796
>Caldera International, Inc.                 fax:   (831) 429-1887
>                                             email: [EMAIL PROTECTED]
>
>_______________________________________________
>htdig-general mailing list <[EMAIL PROTECTED]>
>To unsubscribe, send a message to 
><[EMAIL PROTECTED]> with a subject of 
>unsubscribe
>FAQ: http://htdig.sourceforge.net/FAQ.html


-- 
Complete Guide to Search Engines for Web Sites and Intranets
    <http://www.searchtools.com>

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to