Re: [Firebird-devel] Record storage

Dmitry Yemanov Thu, 09 Jun 2022 05:59:11 -0700

09.06.2022 15:16, Adriano dos Santos Fernandes wrote:


With some frequency people ask me why UTF-8 is slower than single byte
charsets.

The thing is, they have something using, for example, VARCHAR(30)
CHARACTER SET WIN1252 and convert to VARCHAR(30) CHARACTER SET UTF8,
test with the same data and have slower queries.

Database is also increased in size and record size (based on characters)
limit is decreased.

But if they test VARCHAR(120) CHARACTER SET WIN1252 vs VARCHAR(30)
CHARACTER SET UTF8, database size and query times are similar. But this
is just a test, it's not real world scenario user wants.

We have old problems, for example, record size limit is tracked here:
https://github.com/FirebirdSQL/firebird/issues/1130

Like commented there, I tried to just increase the constant and it seems
to just work.

Yes, it should work. However, I'm not going to remove the limit until weintroduce a denser compression. Also, we have a number of places whererecords is stored unpacked in memory (rpb's, RecordBuffer, HashJoin,etc), so longer records could increase server memory usage. This shouldbe improved somehow.

Then we have the RLE record compression algorithm, that "compress" bytes
that is well known to be unused. We had even patches to improve the bad
algorithm.


Yep.

I believe that is not the way to go.

So do I, although an improved RLE could be a good workaround untilsomething significantly better is invented.

Let's still call it "record compression", I believe it should be more
active. Instead of work based only on the record buffer and its length,
it should have access to the record format.

Then it can encode things in more active way, trimming out unused bytes
of CHAR/VARCHAR, better encoding numbers and booleans. We may use
protocol-buffers format as inspiration.

And then probably we don't need any RLE compression as most of data (not
unused bytes) are not so repetitive.

I tried something like that in the past, but I called it "packing" asopposed to "compression". The idea was to (1) skip NULL fields asthey're already marked in a leading bit mask, (2) skip padding bytesbecause they can be reconstructed using a record format, (3) copy onlymeaningful bytes of VARCHAR strings (using its vary_length which is alsostored). The rest (numerics/dates/CHARs) was copied "as is" (withoutcompression). Of course, CHARs and the real part of VARCHARs could becompressed one way or another, but I intentionally left it for another day.

The problem, however, is that format-aware processing was found to beslower. The dumb scheme presented above (with no real compression)provided almost the same record size as RLE compression for mixed"real-world" fields and was even denser for records with longish UTF8fields, but it was also ~20% slower. Every field processed/copiedseparately is slower than processing the record as a whole. It can beproved even for current RLE - the more "runs" (compressed/uncompressed)we have there, the slower is decompression.

I know that RedSoft tried to implement a mixed compression where RLE wasused together with format-aware logic which decided what should become acompressible run. I don't recall the performance figures though, maybeRoman could share them.

What do you think and are there any active work in this regard?

Right now (for ODS 13.1) I'm working on improvement to the current RSEthat does two things: (1) reduces number of runs (avoid shortcompressible runs) and (2) allow longish compressible runs with 2-byte(and possibly 3/4-byte) length. That should solve the problem with UTF8strings without any performance penalty.

For the next ODS, I was going to continue researches regarding "packing"and mixed "packing/compression" approaches to address the performanceissues. Nothing is done yet.


I'd also like to consider a completely different record storage format:

<null mask><fixed prefix><vary suffix>

where all VARCHARs are stored as <length> in the prefix part and theircontents is stored in the suffix part. This makes recordsvariable-length and supposedly many code must be changed for that.However, it reduces memory usage for records (only real length isstored) and it allows flexible encoding: "as is" copying or some cleverpacking for the fixed prefix and e.g. LZ4 compression for the variablesuffix.



Dmitry


Firebird-Devel mailing list, web interface at 
https://lists.sourceforge.net/lists/listinfo/firebird-devel

Re: [Firebird-devel] Record storage

Reply via email to