All, This is how I see the things.
RLE is good for record compression due to the following facts: 1) In-memory record has all the fields expanded up to their maximum size. Shorter values are padded with zeroes (that are easy to compress). 2) In-memory record stores the NULL bitmap and at the same time zaps the space of the NULL values with zeroes (that are easy to compress). 3) In-memory record has all the fields aligned to their respective boundaries. Unused gaps are also zapped with zeroes (that are easy to compress). Of course, it also works good for small numbers stored in INT/BIGINT fields and for repeating character sequences in CHAR/VARCHAR fields. But I'd say that 95% of the RLE effect is covered by the aforementioned three cases, especially (1) and (2). Longish VARCHAR fields in UTF8 are an extreme example. The legacy algorithm compresses up to 128 bytes into 2 bytes, i.e. has maximum compression ratio of 64x. It means that 10KB compresses into 156 bytes while theoretically it could be compressed into three or four bytes. Quite a huge difference. This is what the suggested new algorithm (actually, a clever modification of the old one) is expected to solve. So far so good. My question is would the RLE compression still useful if points (1) - (3) disappear. For example, the record is packed the following way: - NULL bitmap - alignment padding is skipped - NULL fields are skipped - VARCHARs are stored in their actual length - probably the same for CHARs, just need to spend some CPU cycles to calculate the actual length (*) In other words, records are packed/unpacked semantically, using their formats. How such an approach compare to the old RLE algorithm? To the proposed one? Anyone willing to test? This way, RLE could still be used for long VARCHARs, optionally. Or replaced with a value-based encoding or something else, again optionally. I'd go even further and eliminate point (1) completely: - record consists of two parts: fixed (described by the format) and variable (follows the fixed one) - all fixed-size fields are stored as now - VARCHAR is stored as {length, offset} - offset points to its data stored in the tail i.e. switch to variable-length records. It would cost us some extra reallocations at runtime, but not so many as one could think. It could dramatically reduce memory usage in some subsystems. And much-shorter-than-declared strings don't need mandatory RLE compression. This surely still needs more thinking, but you get the idea. (*) I doubt any sane person uses CHARs for strings longer than a few dozens of characters, so I wouldn't care much about optimizing their storage. Dmitry ------------------------------------------------------------------------------ Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140 Firebird-Devel mailing list, web interface at https://lists.sourceforge.net/lists/listinfo/firebird-devel