RLR> The purpose of compression is of course NOT to make the store smaller.
RLR> It is to decrease the amount of I/O that is required to access the
RLR> store.

time to do I/O does not linearly depend on I/O amount.

for example, HDD reads data in blocks, with each block definitely not less than 512 bytes (most likely larger due to HDD doing read-aheads). if you compress your data into 50 bytes, 10 times, it won't make I/O anyhow faster.

moreover, random reads for HDD are orders (!) of magnitude slower than linear reads -- time to reposition device head is larger than time reading consecutive blocks.

i can show this in numbers, just checked on iometer, doing random reads:
512 byte blocks: 106 operations per second (~10 ms seek time), 0.05 MiB/sec
64 KiB   blocks:  95 operations per second (~10 ms seek time), 5.95 MiB/sec

you see, we have 120 times (!) more I/O, but very low difference in number of operations per second. so compression only will matter to people who store at least hundred KiB as their values.

or it will matter if people are doing lots of linear reads -- but it's hard for me to imagine such database access patterns.

RLR> there would be some databases for which it might be CPU-bound.  It is
RLR> hard for me to imagine this is true of most databases.

then it's worth defining what is "most databases" in case of elephant --
optimization without having a concrete target absolutely has no sense.

IMHO most people will be working with databases smaller than few gigabytes -- so read I/O has no sense for them (DDR reads are on scale of gigabytes per second),
write I/O might matter, though.

RLR> If one imagines an integer-valued slot with a Poisson distribution of
RLR> values, for example having an average value of less than a thousand
RLR> but still occasionally containing a much large number, then one can
RLR> expect a simple codec that employs this fact to use about 10 bits for
RLR> each value, and 30 bits for the outliers.  The result is again a
RLR> 3-fold reduction in I/O.

bits, huh? what storage nowadays works on bit level?
even cache lines are 32 bytes wide, so make no difference to read 1 byte or 32 of them.

RLR> Object pointers may of course cross the entire space in theory, but
RLR> OIDs that we store in the database may indeed come from a set of quite
RLR> knowable size, allowing us to use less than 64 bits or 32 bits to
RLR> represent them.  If we know the type of a slot to be only OIDs of a
RLR> particular class, we may very well know that that class has only 2**20
RLR> members, and therefore demands only 20 bits, as opposed to 64, to
RLR> represent the entire space of possible OIDs.

if you expect database being large enough for I/O to matter, why do you think then that there would be that few objects?

_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel

Reply via email to