All,

This is how I see the things.

RLE is good for record compression due to the following facts:

1) In-memory record has all the fields expanded up to their maximum 
size. Shorter values are padded with zeroes (that are easy to compress).

2) In-memory record stores the NULL bitmap and at the same time zaps the 
space of the NULL values with zeroes (that are easy to compress).

3) In-memory record has all the fields aligned to their respective 
boundaries. Unused gaps are also zapped with zeroes (that are easy to 
compress).

Of course, it also works good for small numbers stored in INT/BIGINT 
fields and for repeating character sequences in CHAR/VARCHAR fields. But 
I'd say that 95% of the RLE effect is covered by the aforementioned 
three cases, especially (1) and (2). Longish VARCHAR fields in UTF8 are 
an extreme example.

The legacy algorithm compresses up to 128 bytes into 2 bytes, i.e. has 
maximum compression ratio of 64x. It means that 10KB compresses into 156 
bytes while theoretically it could be compressed into three or four 
bytes. Quite a huge difference. This is what the suggested new algorithm 
(actually, a clever modification of the old one) is expected to solve. 
So far so good.

My question is would the RLE compression still useful if points (1) - 
(3) disappear. For example, the record is packed the following way:

- NULL bitmap
- alignment padding is skipped
- NULL fields are skipped
- VARCHARs are stored in their actual length
- probably the same for CHARs, just need to spend some CPU cycles to 
calculate the actual length (*)

In other words, records are packed/unpacked semantically, using their 
formats.

How such an approach compare to the old RLE algorithm? To the proposed 
one? Anyone willing to test?

This way, RLE could still be used for long VARCHARs, optionally. Or 
replaced with a value-based encoding or something else, again optionally.

I'd go even further and eliminate point (1) completely:

- record consists of two parts: fixed (described by the format) and 
variable (follows the fixed one)
- all fixed-size fields are stored as now
- VARCHAR is stored as {length, offset}
- offset points to its data stored in the tail

i.e. switch to variable-length records. It would cost us some extra 
reallocations at runtime, but not so many as one could think. It could 
dramatically reduce memory usage in some subsystems. And 
much-shorter-than-declared strings don't need mandatory RLE compression.

This surely still needs more thinking, but you get the idea.

(*) I doubt any sane person uses CHARs for strings longer than a few 
dozens of characters, so I wouldn't care much about optimizing their 
storage.


Dmitry

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
Firebird-Devel mailing list, web interface at 
https://lists.sourceforge.net/lists/listinfo/firebird-devel

Reply via email to