> The file is 1.2Gb big. As you can see it contains 60 millions
> entries and space lost is 1%. I did that to test what we can expect if
> entries are compressed (as suggested by Keith, using huffman with
> static freq).
The question that I would suggest asking is to select a Huffman
encoding scheme, and then see what percentage of compression
you would expect to see on the data that you're storing. That
is completely orthogonal to the fact that those key/data items
are stored in a Btree. Huffman encoding will not affect the
Btree storage algorithms at all, and you'll see the same
page-fill factors both before and after encoding (see below).
> I would say that this is encouraging. There is still a mystery
> though (at least for me :-). Why do I have 99% fill in this usage pattern
> and 60% only when using keys and data whose size vary between 6 bytes
> and 35 bytes, for an average size of 8 bytes.
The trick is to insert sorted data. If you split a full Btree
page, half of the original page is put on one page and half on
another, giving you a 50% page fill factor. If you insert
sorted data, the Berkeley DB implementation splits in a special
way, putting almost all the old data on one page and only a
single key on the new page. From the Berkeley DB documentation:
The B+tree access method is an implementation of a sorted, balanced
tree structure. Searches, insertions, and deletions in the tree all
take O(log base_b N) time, where base_b is the average number of keys
per page, and N is the total number of keys stored. Often, inserting
ordered data into B+tree implementations results in pages that are
half-full. Berkeley DB makes ordered (or inverse ordered) insertion
the best case, resulting in nearly perfect page space utilization.
> When we will compress entries, the size of the keys and the size
> of the data will not be fixed size. Will we lose 40% space then ?
Compression doesn't matter, it's whether or not the keys that
you insert are sorted. My expectation for a Huffman compression
scheme would be that all key comparisons would still have to be
done with the uncompressed keys (so that user-specified
comparison routines would continue to work).
> In conclusion, assuming that 1 document contains 100 words in
> average and assuming that we can come close to this usage pattern using
> static huffman compression, we would be able to index more than 500 000
> documents in a 1.2gb word file.
I don't think this is quite right.
If you get N% compression from Huffman encoding, that's the
level of compression you should see from your data, but the
page-fill factor won't change much -- smaller key/data pairs do
result in better page-fill factors in Btrees, but not N%! So,
if you do Huffman encoding you get N% plus a little bit due to
the side-effect of generally storing smaller objects.
To the extent that you can store your data in sorted order, you
get M% of your data space back, i.e., if you were seeing 60%
page-fill factors before, and you can insert in pure sorted
order, M would be close to 40%. This is separate from the
Huffman encoding.
There are a lot of other algorithms to keep the page-fill factor
in the Btrees higher, mostly involving page-balancing when keys
are added or deleted in the tree. The reason that Berkeley DB
does not implement them is because they greatly increase the
chance of deadlock during the operation because you're locking
logically some number of unrelated leaf pages during an operation.
Regards,
--keith
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Keith Bostic
Sleepycat Software Inc. [EMAIL PROTECTED]
394 E. Riding Dr. +1-978-287-4781
Carlisle, MA 01741-1601 http://www.sleepycat.com
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.