Re: [Firebird-devel] Recore level compresion imroovement

Slavomir Skopalik Mon, 16 Mar 2015 12:18:05 -0700

Hi Jim,
I have only my DBs, that is designed for short record length (on disk).
I looking for some real examples, but is not easy to get it.


Some data from my DB (new RLE):

  Primary pointer page: 384, Index root page: 385
    Total formats: 1, used formats: 1
    Average record length: 31.20, total records: 125819782
    Average version length: 0.00, total versions: 0, max versions: 0
    Average fragment length: 0.00, total fragments: 0, max fragments: 0
    Average unpacked length: 8038.00, compression ratio: 257.65
    Pointer pages: 119, data page slots: 385896

About packed size trashold,
dictionary based compresions are ineficient on small size data.

http://wiki.illumos.org/display/illumos/LZ4+Compression

there some statistics:

https://www.illumos.org/attachments/822/lz4_compression_bench.ods

http://fastcompression.blogspot.cz/2013/08/inter-block-compression.html

and some frame information is needed (about 15 bytes per packed block).

For short records (as my in example) will be not help.

For long records (typical situation with filled note to anything) cansignificatly helps.

Also it can be very efective on text blobs, mainly stored in HTML format.

Finally, I'm not sure that 4kB is good trashhold, but I belive, thatsome trashold will be needed.


Slavek

Ing. Slavomir Skopalik
Executive Head
Elekt Labs s.r.o.
Collection and evaluation of data from machines and laboratories
by means of system MASA (http://www.elektlabs.cz/m2demo)
-----------------------------------------------------------------
Address:
Elekt Labs s.r.o.
Chaloupky 158
783 72 Velky Tynec
Czech Republic
---------------------------------------------------------------
Mobile: +420 724 207 851
icq:199 118 333
skype:skopaliks
e-mail:skopa...@elektlabs.cz
http://www.elektlabs.cz

On 16.3.2015 17:42, James Starkey wrote:

I'd like to see some numbers computed from an actual (real) Firebird
database before it is considered.

But why records only over 4k?  And what commonality do you expect to find
on large records?

On Monday, March 16, 2015, Slavomir Skopalik <skopa...@elektlabs.cz> wrote:

Hi Jim,
I made some research about storage compresion and I found this project:

https://code.google.com/p/lz4/

My idea is to use this only if encoded size of record will be more than
aprox 4Kb.

Do you have any note, why it can be bad idea?

Thanks Slavek

PS: I was made some changes in Firebird to rip out compresor from
storage engine (and put new RLE in second step, encoding in third step),
but it was rejected by comunity :)

Ing. Slavomir Skopalik
Executive Head
Elekt Labs s.r.o.
Collection and evaluation of data from machines and laboratories
by means of system MASA (http://www.elektlabs.cz/m2demo)
-----------------------------------------------------------------
Address:
Elekt Labs s.r.o.
Chaloupky 158
783 72 Velky Tynec
Czech Republic
---------------------------------------------------------------
Mobile: +420 724 207 851
icq:199 118 333
skype:skopaliks
e-mail:skopa...@elektlabs.cz <javascript:;>
http://www.elektlabs.cz

On 1.3.2015 18:55, Slavomir Skopalik wrote:

Hi Jim,
my proposal was not so abstract as your.

I just want to put all parts of encoding/decoding  into one class with
clear interface that will able
to put different ecoder in development time (FB3+).

I will contact Firebird developer to make agreement about changes is
this class:

class Compressor : public Firebird::AutoStorage

If will be posible to have access in record format, it can be easy to

create

self-described encoding.
I have in my mind a idea of this schema that I woudlike to test it.

Slavek

Ing. Slavomir Skopalik
Executive Head
Elekt Labs s.r.o.
Collection and evaluation of data from machines and laboratories
by means of system MASA (http://www.elektlabs.cz/m2demo)
-----------------------------------------------------------------
Address:
Elekt Labs s.r.o.
Chaloupky 158
783 72 Velky Tynec
Czech Republic
---------------------------------------------------------------
Mobile: +420 724 207 851
icq:199 118 333
skype:skopaliks
e-mail:skopa...@elektlabs.cz <javascript:;>
http://www.elektlabs.cz

On 28.2.2015 22:43, Jim Starkey wrote:

OK, I think I understand what you are trying to do -- and please

correct me if I'm wrong.  You want to standardize an interface between an
encoding and DPM, separating the actual encoding/decoding from the
fragmentation process.  In other words, you want to compress a record in
toto then let somebody else to chop the resulting byte stream to and from
data pages.  In essence, this makes the compression scheme plug replaceable.

If this is your intention, it isn't a bad idea, but it does have

problems.  The first is how to map a given record to a particular decoding
schema.  The second, more difficult, is how to do this without bumping the
ODS (desirable, but not essential).  A third is how to handle encodings
that are not variations on run length encoding (such as value based
encoding).

If I'm on the right track, do note that the current decoding schema

already fits your bill.  Concatenate the fragments and decode.  The
encoding process, on the other hand, is more problematic.

Encoding/decoding in place is more efficient than using a temp, but not

so much as to preclude it.  I might be wrong, but I doubt that the existing
schema shows up as a hot spot in a profile.  But that said, I'm far from
convinced that variations on a run length theme are going to have any
significant benefit for either density or performance.

My post Interbase database systems don't access records on page (NuoDB

doesn't even have pages).  Records have one format in storage and others
formats in memory within a record class that understands the transitions
between formats (essentially doing the various encode and decoding).  There
are generally an encoded form (raw byte stream), a descriptor vector for
buiding new records, and some sort of ancillary structure for field
references to either.

In my mind, I think it would be wiser to Firebird to go with a flexible

record object than to simply abstract the encoding/decoding process.  More
code would need to be changed, but when you were done, there would be much
less code.

Architecturally, abstracting encoding/decode makes sense, but

practically, I don't it buys much.  A deep reorganzation, I believe, would
have a much better long term payoff.

But then maybe I missed your point...

Jim Starkey

On Feb 28, 2015, at 10:30 AM, Slavomir Skopalik <skopa...@elektlabs.cz

<javascript:;>> wrote:

Hi Jim,
I don't want to change ODS for saving one byte per page.
I want to change sources to be able implement different
encoder (put name that you want) -> change ODS.

For some encoder is frangmentation lost 1-2 byte, for another
can be more.
For some encoder is easy to do reverse parsing, for some other
is much more complicated.

For some situation can be generation of control stream benefit,
but as is now in sources (FB2.5, FB3) that I read, it is not.

Current compressor interface:
to create control stream:
ULONG SQZ_length(const SCHAR* data, ULONG length, DataComprControl*

dcc)

to create final stream from control stream:
void SQZ_fast(const DataComprControl* dcc, const SCHAR* input, SCHAR*
output)

To calculate how many bytes can be commpressed into small area (from
control stream):
USHORT SQZ_compress_length(const DataComprControl* dcc, const SCHAR*
input, int space)

To compress into small area:
USHORT SQZ_compress(const DataComprControl* dcc, const SCHAR* input,
SCHAR* output, int space)

and decomress:
UCHAR* SQZ_decompress(const UCHAR*    input,  USHORT        length,
UCHAR*        output,   const UCHAR* const    output_end)

And some routines is directly in storage code.

In FB3 is very similar (changed names, organized into class, same hack
in store_big_record(problem is not code itself, but where the code

is)).

The question is:
Why keep control stream (worst CPU, litle worst HDD, and also important
for me - worst readable code)?
It seems to be, that was implemented this way because of RAM

limitation.

And another question:
What functions and parameters have been in new interface?

If you have idea how to use control stream with benefits, please share

it.

Slavek

BTW: If we drop control stream, posted code will reduce to one movecpy
that is implemented by SSE+ instructions.




------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website,
sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for
all
things parallel software development, from weekly thought leadership blogs
to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
Firebird-Devel mailing list, web interface at
https://lists.sourceforge.net/lists/listinfo/firebird-devel




------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/


Firebird-Devel mailing list, web interface at 
https://lists.sourceforge.net/lists/listinfo/firebird-devel

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/

Firebird-Devel mailing list, web interface at 
https://lists.sourceforge.net/lists/listinfo/firebird-devel

Re: [Firebird-devel] Recore level compresion imroovement

Reply via email to