Hi!

With some frequency people ask me why UTF-8 is slower than single byte
charsets.

The thing is, they have something using, for example, VARCHAR(30)
CHARACTER SET WIN1252 and convert to VARCHAR(30) CHARACTER SET UTF8,
test with the same data and have slower queries.

Database is also increased in size and record size (based on characters)
limit is decreased.

But if they test VARCHAR(120) CHARACTER SET WIN1252 vs VARCHAR(30)
CHARACTER SET UTF8, database size and query times are similar. But this
is just a test, it's not real world scenario user wants.

We have old problems, for example, record size limit is tracked here:
https://github.com/FirebirdSQL/firebird/issues/1130

Like commented there, I tried to just increase the constant and it seems
to just work.

Then we have the RLE record compression algorithm, that "compress" bytes
that is well known to be unused. We had even patches to improve the bad
algorithm.

I believe that is not the way to go.

Let's still call it "record compression", I believe it should be more
active. Instead of work based only on the record buffer and its length,
it should have access to the record format.

Then it can encode things in more active way, trimming out unused bytes
of CHAR/VARCHAR, better encoding numbers and booleans. We may use
protocol-buffers format as inspiration.

And then probably we don't need any RLE compression as most of data (not
unused bytes) are not so repetitive.

What do you think and are there any active work in this regard?


Adriano


Firebird-Devel mailing list, web interface at 
https://lists.sourceforge.net/lists/listinfo/firebird-devel

Reply via email to