I feel compelled to respectfully disagree with Alex's views stated below. The purpose of compression is of course NOT to make the store smaller. It is to decrease the amount of I/O that is required to access the store. I think this is where most of the time is spent. I suspect Alex is correct that even if Elephant were coded perfectly (which it is not), there would be some databases for which it might be CPU-bound. It is hard for me to imagine this is true of most databases.
I would like to make three points. Firstly, irrespective of the ever-changing ratio of I/O costs within the storage hierarchy, smaller will always be faster. Solid state drives and gigantic RAMs don't change this; they merely move the benefit by allowing one to store more data inside the the cache. As technology improves, this cache can be thought of as RAM caching disk, or fast ram caching slow ram, or register sets caching fast ram. The fact that locality of reference is a positive good for performance remains an iron-clad law as long as you actually have a storage hierarchy with different levels of smaller and faster vs. bigger and slower storage. All of my professional and academic experience leads me to believe that any database application is I/O bound and finding a way to trade a constant amount of CPU time to decrease I/O load is a good idea. Secondly the compression that we're talking about must not be "optimal" compression that seeks maximum compression. It is instead something that can be efficiently computed, yet saves space. Finally, the examples that you name, small objects, integers, small strings, and object pointers, are precisely those objects for which it is efficient to compute compact representations. For example, simple frequency-based Huffman encoding based on character frequency will give about a factor of 3 compression on strings. If one imagines an integer-valued slot with a Poisson distribution of values, for example having an average value of less than a thousand but still occasionally containing a much large number, then one can expect a simple codec that employs this fact to use about 10 bits for each value, and 30 bits for the outliers. The result is again a 3-fold reduction in I/O. Object pointers may of course cross the entire space in theory, but OIDs that we store in the database may indeed come from a set of quite knowable size, allowing us to use less than 64 bits or 32 bits to represent them. If we know the type of a slot to be only OIDs of a particular class, we may very well know that that class has only 2**20 members, and therefore demands only 20 bits, as opposed to 64, to represent the entire space of possible OIDs. If one imagines a page-based system that brings in a group of objects in one page, then getting 3 times as many objects for the same I/O is a huge overall win. However, my point here was that I wanted to stay in "pure" LISP so that I/we would have the power to do this. (I understand now that Ian was not proposing otherwise.) Whether it will work is somewhat academic, because it should certainly be deployed as a "strategy" -- that is, the system either decides based on statistics to employ compression for a given slot, or you declare a slot to be :poisson 1000 or :english-text or something to suggest the most efficient codec. As you point out, developing a type-specific codec, that might do a very good job compressing and improving performance based upon specialized knowledge of a particular application-dependent datatype may be quite rare, but I think it is an important ability. Therefore, since we should undoubtedly just develop a plain B+Tree first, and any implementation of compression should be a strategy-based approach, this argument is somewhat academic. However, in theory someone could write a compressing serializer right now, independently of implementing a pure-lisp Btree-based backend. For example, the serializer that Rucksack has written could be inserted in place of our existing serializer (or even set a configurable option) if we found it to be a better codec. On Wed, 2008-05-14 at 18:36 +0300, Alex Mizrahi wrote: > IE> I was quite wrong-headed on the performance point. The biggest > time > IE> cost in database operations is time spent moving data to/from > disk, in > IE> comparison CPU time is quite small. > > this depends on size of a database -- relatively small ones can be > cached in > RAM, and with asynchronous writes disk won't be a bottleneck. > also note that RAM gets larger and cheaper each year, so current > "relatively > small ones", on scale of several gigabytes, were considered big some > years > ago. > > also database are often optimized to minimize HDD seek times. but for > now we > have quite affordable Solid State Drives which do not have any seek > time > overhead at all and are pretty fast, so such optimization do not have > any > sense anymore. > > also elephant's workload is likely to be very different from > relational > database's one -- with RDMBS people often try to do as much work as > possible > with single optimized query, while Elephant users tend to do lots of > small > queries -- because it's, um, easy. also Elephant lacks optimization > tricks > that RDBMS can do, so it has to rely on crude processing power. > > so i believe that Elephant will be CPU-bound for most of uses/users, > and any > complications like compression will make it slower. > > people who will use Elephant for storage and rapid retrieval of some > terabytes of text will find compression a cool feature, but i > seriously > doubt that there are many people who consider elephant a tool for > such > stuff. > > if you want to optimize something, better make sure that small > objects, like > numbers, object pointers, small strings etc. can be stored/retrieved > with as > little overhead as possible -- i think most values in database are > such. _______________________________________________ elephant-devel site list elephant-devel@common-lisp.net http://common-lisp.net/mailman/listinfo/elephant-devel