On Thu, 2008-05-15 at 10:39 +0300, Alex Mizrahi wrote: > RLR> The purpose of compression is of course NOT to make the store smaller. > RLR> It is to decrease the amount of I/O that is required to access the > RLR> store. > > time to do I/O does not linearly depend on I/O amount. > > for example, HDD reads data in blocks, with each block definitely not less > than 512 bytes (most likely larger due to HDD doing read-aheads). > if you compress your data into 50 bytes, 10 times, it won't make I/O anyhow > faster.
You are correct, if you mean the time to perform a single "point query" --- that is, a to look up a value by a single key. Of course, compressing it won't hurt. > > moreover, random reads for HDD are orders (!) of magnitude slower than > linear reads -- time to reposition device head is larger than time reading > consecutive blocks. > > i can show this in numbers, just checked on iometer, doing random reads: > 512 byte blocks: 106 operations per second (~10 ms seek time), 0.05 MiB/sec > 64 KiB blocks: 95 operations per second (~10 ms seek time), 5.95 MiB/sec > > you see, we have 120 times (!) more I/O, but very low difference in number > of operations per second. > so compression only will matter to people who store at least hundred KiB as > their values. > > or it will matter if people are doing lots of linear reads -- but it's hard > for me to imagine such database access patterns. I think linear reads are quite common. They are, after all, why we have implemented range queries. Moreover, the act of reading all the data into memory is quite common if you are using a "Object Prevalance" model. In a relational database a query optimizer has a very accurate model of the number of pages that will have to be read by any given query, and attempts to search the space of all possible plans in order to find the best one. In many cases, this is indeed a linear scan; for example many join operations can best be performed by performing a linear scan of at least one of the tables in question. The aggregate functions (sum, min, max, avg, count) are another example. Another typical example is "access all page views of my website from Thursday to Friday". > > RLR> there would be some databases for which it might be CPU-bound. It is > RLR> hard for me to imagine this is true of most databases. > > then it's worth defining what is "most databases" in case of elephant -- > optimization without having a concrete target absolutely has no sense. Yes, I agree. I'm familiar with what you might call "enterprise" applications --- large database-backed websites, with maybe a few million rows, of a total size of only tens of gigabytes. For such applications, I think my assumptions hold; and these are the same assumptions that I dealt with in my dissertation. In fact my entire professional career could be summarize as applying caching to more and more objects within an enterprise to avoid I/O costs. Perhaps you can post a description of your database usage pattern, to explain why it is dominated by point queries. > > IMHO most people will be working with databases smaller than few > gigabytes -- > so read I/O has no sense for them (DDR reads are on scale of gigabytes per > second), > write I/O might matter, though. Yes, write I/O does matter. I am assuming this will be done in fairly large pages---8K at one time was the standard; I'm not sure what is favored now. However, even in a 4 gigabyte or smaller database the read time really matters. > > RLR> If one imagines an integer-valued slot with a Poisson distribution of > RLR> values, for example having an average value of less than a thousand > RLR> but still occasionally containing a much large number, then one can > RLR> expect a simple codec that employs this fact to use about 10 bits for > RLR> each value, and 30 bits for the outliers. The result is again a > RLR> 3-fold reduction in I/O. > > bits, huh? what storage nowadays works on bit level? You are quite correct that most usage of storage is done at the byte and word level---I am arguing that this is an opportunity for us. Of course, one must understand that the page will be read and written as an 8K chunk of bytes, which will contain carefully packed bit sets that do not respect byte boundaries. The result will be that more objects can fit into an 8K page, which will tend to decrease I/O by that amount. The byte is the smallest object that can be written to a file, but that is irrelevant to how data is laid out within a block of bytes. > even cache lines are 32 bytes wide, so make no difference to read 1 byte or > 32 of them. Correct -- you read 8K at a time, and hope that because you care getting 300 objects in one page instead of 100, you don't have to read the next two pages, and that because you have fewer pages to cache, the chance that any page you need to look up will reside in your cache is much higher. > > RLR> Object pointers may of course cross the entire space in theory, but > RLR> OIDs that we store in the database may indeed come from a set of quite > RLR> knowable size, allowing us to use less than 64 bits or 32 bits to > RLR> represent them. If we know the type of a slot to be only OIDs of a > RLR> particular class, we may very well know that that class has only 2**20 > RLR> members, and therefore demands only 20 bits, as opposed to 64, to > RLR> represent the entire space of possible OIDs. > > if you expect database being large enough for I/O to matter, why do you > think then that there would be that few objects? A database can't be so small that I/O doesn't matter. I suppose a database can be so small that performance doesn't matter---but if performance matters, then I/O matters. I think one of our differences is that you are looking at Elephant as it is, and I imagining Elephant with a caching system, and perhaps a file system that has a cache, and I'm not neglecting the relatively long time it takes for a CPU to retrieve a byte from memory which is not already in the CPU cache. > > _______________________________________________ > elephant-devel site list > elephant-devel@common-lisp.net > http://common-lisp.net/mailman/listinfo/elephant-devel _______________________________________________ elephant-devel site list elephant-devel@common-lisp.net http://common-lisp.net/mailman/listinfo/elephant-devel