:> Hi all...
:>
:> One the tasks that I have undertaken lately is to improve the efficiency
:> of a couple of storage facilities we use internally, here. Basically,
:> they are moderate-size tables currently implemented in SQL, which is OK in
:> terms of performance, but the hash function is breaking down and the
:> storage is rather inefficient for our table of about 2,850,000 members
:> (~2.1 GB total storage). There are 64M possible hash values in our
:> current implementation, and our record size is variable, but could be
:> safely fixed at about 1.5KB... So, total storage if all values were used
:> would be about 96GB. (See where I'm going with this?)
:>
:> One of the options I am considering is actually using address calculation,
:> and taking advantage of filesystem holes, to keep storage down to what is
:> actually being used, while providing instant lookups.
:>
:> The single file would be about 96G addressable bytes... But the actual
:> block count would be much lower. I suppose I will have to create a series
:> of these files and divide the problem into < 4GB chunks, but one
:> lookup/store will still be done with one calculation and one disk seek
:> (just more filehandles open).
:>
:> Deletes seem problematic. My question is, does the operating system
:> provide any way to free blocks used in the middle of a file?
:>
:> Must I periodically re-create these files (long and slow process, but not
:> so bad if done infrequently) to reclaim space, or is there a way to free
:> arbitrary blocks in a file in userland (WITHOUT trashing the filesystem?
:> :-)
:>
:> - Ryan
:>
:> --
:> Ryan Thompson <[EMAIL PROTECTED]>
:> Network Administrator, Accounts
:> Phone: +1 (306) 664-1161
:>
:> SaskNow Technologies http://www.sasknow.com
:> #106-380 3120 8th St E Saskatoon, SK S7H 0W2
I would strongly recommend using a B+Tree instead of a hash table. With
a hash table slightly different lookups will seek all over the place.
With a B+Tree lookups will stay more localized.
For example, if you insert a (nearly) sorted dictionary of words into an
SQL table with a B+Tree, the memory working set required to insert
efficiently stays constant whether you are inserting a thousand, a million,
or a billion records. That is, the memory requirement is effecitvely
O(LogN) for a disk requirement of O(N). With a hash table, the memory
working set required to insert efficiently is approximately O(N) for a disk
requirement of O(N)... much much worse.
A B+Tree will also scale with the size of the dataset being managed,
so you do not have to preallocate or prereserve file space.
We are using an in-house SQL database for our product (which I can't
really talk about right now) and, using B+Tree's, I can consistently
insert 300 records/sec on a cheap desktop PC (which I use for testing)
with 64MB of ram (remember, insert requires an internal select to check
for key conflicts), even when the test database grows to several
gigabytes.
In anycase, a B+Tree is the way to go. Hash tables are useful only in
very, very, very specialized circumstances. In all other circumstances
they are no better then a pain in the rear.
-Matt
To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message