09.06.2022 15:16, Adriano dos Santos Fernandes wrote:
With some frequency people ask me why UTF-8 is slower than single byte
charsets.
The thing is, they have something using, for example, VARCHAR(30)
CHARACTER SET WIN1252 and convert to VARCHAR(30) CHARACTER SET UTF8,
test with the same data and have slower queries.
Database is also increased in size and record size (based on characters)
limit is decreased.
But if they test VARCHAR(120) CHARACTER SET WIN1252 vs VARCHAR(30)
CHARACTER SET UTF8, database size and query times are similar. But this
is just a test, it's not real world scenario user wants.
We have old problems, for example, record size limit is tracked here:
https://github.com/FirebirdSQL/firebird/issues/1130
Like commented there, I tried to just increase the constant and it seems
to just work.
Yes, it should work. However, I'm not going to remove the limit until we
introduce a denser compression. Also, we have a number of places where
records is stored unpacked in memory (rpb's, RecordBuffer, HashJoin,
etc), so longer records could increase server memory usage. This should
be improved somehow.
Then we have the RLE record compression algorithm, that "compress" bytes
that is well known to be unused. We had even patches to improve the bad
algorithm.
Yep.
I believe that is not the way to go.
So do I, although an improved RLE could be a good workaround until
something significantly better is invented.
Let's still call it "record compression", I believe it should be more
active. Instead of work based only on the record buffer and its length,
it should have access to the record format.
Then it can encode things in more active way, trimming out unused bytes
of CHAR/VARCHAR, better encoding numbers and booleans. We may use
protocol-buffers format as inspiration.
And then probably we don't need any RLE compression as most of data (not
unused bytes) are not so repetitive.
I tried something like that in the past, but I called it "packing" as
opposed to "compression". The idea was to (1) skip NULL fields as
they're already marked in a leading bit mask, (2) skip padding bytes
because they can be reconstructed using a record format, (3) copy only
meaningful bytes of VARCHAR strings (using its vary_length which is also
stored). The rest (numerics/dates/CHARs) was copied "as is" (without
compression). Of course, CHARs and the real part of VARCHARs could be
compressed one way or another, but I intentionally left it for another day.
The problem, however, is that format-aware processing was found to be
slower. The dumb scheme presented above (with no real compression)
provided almost the same record size as RLE compression for mixed
"real-world" fields and was even denser for records with longish UTF8
fields, but it was also ~20% slower. Every field processed/copied
separately is slower than processing the record as a whole. It can be
proved even for current RLE - the more "runs" (compressed/uncompressed)
we have there, the slower is decompression.
I know that RedSoft tried to implement a mixed compression where RLE was
used together with format-aware logic which decided what should become a
compressible run. I don't recall the performance figures though, maybe
Roman could share them.
What do you think and are there any active work in this regard?
Right now (for ODS 13.1) I'm working on improvement to the current RSE
that does two things: (1) reduces number of runs (avoid short
compressible runs) and (2) allow longish compressible runs with 2-byte
(and possibly 3/4-byte) length. That should solve the problem with UTF8
strings without any performance penalty.
For the next ODS, I was going to continue researches regarding "packing"
and mixed "packing/compression" approaches to address the performance
issues. Nothing is done yet.
I'd also like to consider a completely different record storage format:
<null mask><fixed prefix><vary suffix>
where all VARCHARs are stored as <length> in the prefix part and their
contents is stored in the suffix part. This makes records
variable-length and supposedly many code must be changed for that.
However, it reduces memory usage for records (only real length is
stored) and it allows flexible encoding: "as is" copying or some clever
packing for the fixed prefix and e.g. LZ4 compression for the variable
suffix.
Dmitry
Firebird-Devel mailing list, web interface at
https://lists.sourceforge.net/lists/listinfo/firebird-devel