09.06.2022 15:16, Adriano dos Santos Fernandes wrote:

With some frequency people ask me why UTF-8 is slower than single byte
charsets.

The thing is, they have something using, for example, VARCHAR(30)
CHARACTER SET WIN1252 and convert to VARCHAR(30) CHARACTER SET UTF8,
test with the same data and have slower queries.

Database is also increased in size and record size (based on characters)
limit is decreased.

But if they test VARCHAR(120) CHARACTER SET WIN1252 vs VARCHAR(30)
CHARACTER SET UTF8, database size and query times are similar. But this
is just a test, it's not real world scenario user wants.

We have old problems, for example, record size limit is tracked here:
https://github.com/FirebirdSQL/firebird/issues/1130

Like commented there, I tried to just increase the constant and it seems
to just work.

Yes, it should work. However, I'm not going to remove the limit until we introduce a denser compression. Also, we have a number of places where records is stored unpacked in memory (rpb's, RecordBuffer, HashJoin, etc), so longer records could increase server memory usage. This should be improved somehow.

Then we have the RLE record compression algorithm, that "compress" bytes
that is well known to be unused. We had even patches to improve the bad
algorithm.

Yep.

I believe that is not the way to go.

So do I, although an improved RLE could be a good workaround until something significantly better is invented.

Let's still call it "record compression", I believe it should be more
active. Instead of work based only on the record buffer and its length,
it should have access to the record format.

Then it can encode things in more active way, trimming out unused bytes
of CHAR/VARCHAR, better encoding numbers and booleans. We may use
protocol-buffers format as inspiration.

And then probably we don't need any RLE compression as most of data (not
unused bytes) are not so repetitive.

I tried something like that in the past, but I called it "packing" as opposed to "compression". The idea was to (1) skip NULL fields as they're already marked in a leading bit mask, (2) skip padding bytes because they can be reconstructed using a record format, (3) copy only meaningful bytes of VARCHAR strings (using its vary_length which is also stored). The rest (numerics/dates/CHARs) was copied "as is" (without compression). Of course, CHARs and the real part of VARCHARs could be compressed one way or another, but I intentionally left it for another day.

The problem, however, is that format-aware processing was found to be slower. The dumb scheme presented above (with no real compression) provided almost the same record size as RLE compression for mixed "real-world" fields and was even denser for records with longish UTF8 fields, but it was also ~20% slower. Every field processed/copied separately is slower than processing the record as a whole. It can be proved even for current RLE - the more "runs" (compressed/uncompressed) we have there, the slower is decompression.

I know that RedSoft tried to implement a mixed compression where RLE was used together with format-aware logic which decided what should become a compressible run. I don't recall the performance figures though, maybe Roman could share them.

What do you think and are there any active work in this regard?

Right now (for ODS 13.1) I'm working on improvement to the current RSE that does two things: (1) reduces number of runs (avoid short compressible runs) and (2) allow longish compressible runs with 2-byte (and possibly 3/4-byte) length. That should solve the problem with UTF8 strings without any performance penalty.

For the next ODS, I was going to continue researches regarding "packing" and mixed "packing/compression" approaches to address the performance issues. Nothing is done yet.

I'd also like to consider a completely different record storage format:

<null mask><fixed prefix><vary suffix>

where all VARCHARs are stored as <length> in the prefix part and their contents is stored in the suffix part. This makes records variable-length and supposedly many code must be changed for that. However, it reduces memory usage for records (only real length is stored) and it allows flexible encoding: "as is" copying or some clever packing for the fixed prefix and e.g. LZ4 compression for the variable suffix.


Dmitry


Firebird-Devel mailing list, web interface at 
https://lists.sourceforge.net/lists/listinfo/firebird-devel

Reply via email to