[HACKERS] Reducing data type space usage

Gregory Stark Fri, 15 Sep 2006 10:51:52 -0700


Following up on the recent discussion on list about wasted space in data
representations I want to summarise what we found and make some proposals:


As I see it there are two cases:

Case 1) Data types that are variable length but often quite small. This includes
   things like NUMERIC which in common use will rarely be larger than 12-20
   bytes and often things like text.

   In cases like these we really only need 1 or sometimes 2 byte varlena
   header overhead, not 4 as we currently do. In fact we *never* need more
   than 2 bytes of varlena header on disk anyways with the standard
   configuration.

Case 2) Data types that are different sizes depending on the typmod but are 
always
   the same size that can be determined statically for a given typmod. In the
   case of a ASCII encoded database CHAR(n) fits this category and in any case
   we'll eventually have per-column encoding. NUMERC(a,b) could also be made
   to fit this as well.
   
   In cases like these we don't need *any* varlena header. If we could arrange
   for the functions to have enough information to know how large the data
   must be.

Solutions proposed:

Case 1) We've discussed the variable sized varlena headers and I think it's 
clear
   that that's the most realistic way to approach it.

   I don't think any other approaches were even suggested. Tom said he wanted
   a second varlena format for numeric that would have 2-byte alignment. But I
   think we could always just say that we always use the 2-byte varlena header
   on data types with 2-byte alignment and the 4-byte header on data types
   with 4-byte alignment needs. Or heap_form_tuple could be even cleverer
   about it but I'm not sure it's worth it.

   This limits the wasted space to 1-2% for most variable sized data that are
   50 bytes long or more. But for very small data such as the quite common
   cases where those are often only 1-4 bytes it still means a 25-100%
   performance drain.

Case 2) Solving this is quite difficult without introducing major performance
   problems or security holes. The one approach we have that's practical right
   now is introducing special data types such as the oft-mentioned "char" data
   type. "char" doesn't have quite the right semantics to use as a transparent
   substitute for CHAR but we could define a CHAR(1) with exactly the right
   semantics and substitute it transparently in parser/analyze.c (btw having
   two files named analyze.c is pretty annoying). We could do the same with
   NUMERIC(a,b) for sufficiently small values of a and b with something like
   D'Arcy's CASH data type (which uses an integer internally).

   The problem with defining lots of data types is that the number of casts
   and cross-data-type comparisons grows quadratically as the number of data
   types grows. In theory we would save space by defining a CHAR(n) for
   whatever size n the user needs but I can't really see anything other than
   CHAR(1) being worthwhile. Similarly a 4-byte NUMERIC substitute like CASH
   (with full NUMERIC semantics though) and maybe a 2-byte and 8-byte
   substitute might be reasonable but anything else would be pointless.

I see these two solutions as complementary. The variable varlena headers take
care of the larger data and the special-purpose data types take care of the
extremely small data. And pretty important to cover both cases data that fits
in 1-4 bytes is quite common. You often see databases with dozens of CHAR(1)
flag columns or NUMERIC(10,2) currency columns.

With a CHAR(1) and CASH style numeric substitute we won't have 25-100%
performance lost on the things that would fit in 1-4 bytes. And with the
variable sized varlena header we'll limit to 25% at worst and 1-2% usually the
performance drain due to wasted space on larger data.

Doing better would require a complete solution to data types that can
understand how large they are based on their typmod. That would imply more
dramatic solutions like I mused about involving passing around structures that
contain the Datum as well as the attlen or atttypmod. The more I think about
these ideas the more I think they may have merit but they would be awfully
invasive and require more thought.

-- 
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org

[HACKERS] Reducing data type space usage

Reply via email to