Re: [Firebird-devel] Recore level compresion imroovement

James Starkey Sat, 28 Feb 2015 05:19:52 -0800

I regret both that I don't have a copy of Firebird source on the boat or
access to adequate bandwidth to get it, so I'm not in a position to comment
on tge existing code one way or another.  But as I understand your
proposal, you are suggestion the the ODS be changed to save (at most) one
byte per 4,050 bytes (approximately) of very large fragmented record.  That
isn't much of a payback.


But looking at your code below, it would be much faster if you just
declared your variables as int and get rid of the casts.  All the casts are
doing for you is forcing the compiler to explicitly truncate the results to
16 bits, which is not necessary.

I am aware that it is stylish to throw in as many casts and consts as
possible, but simple type safety is both faster and more readable.

I don't mean to dump on your proposal, but if you're going to make a
change, make a change worth doing.  I'm not a insisting that Firebird adopt
value based encoding as that is a choice for the guys doing the
implementing.  I did make the change from run length encoding to value
based encoding in Netrastructure and found it reduced on-disk record sizes
by 2/3.

And, incidentally, the existing code that you deride as a hack is probably
also my code, though probably reworked by half dozen folks over the years.
Still, I would prefer the term "archaic historical artifact" to "hack" as
it was written on a 1 MB Apollo DN 330 running a 68010, approximately the
norm for workstations circa 1984.  Machines have changed since then, and
with them, the tradeoffs.

On Friday, February 27, 2015, Slavomir Skopalik <skopa...@elektlabs.cz>
wrote:

>  Hi Jim,
> I don't tell your scheme hack, this is misunderstanding.
> I tell, that current implementation of RLE in firebird is hack
> (parsing RLE control stream outside compresor/decompresor in reverse
> order).
>
> If I replace current RLE by anything, I have to do same/worst hack(s).
> And I don't want to go this way (wastage time for bad implementation).
> Please look in code first.
>
>
> http://sourceforge.net/p/firebird/code/HEAD/tree/firebird/branches/B2_5_Release/src/jrd/dpm.epp
>
> // Move compressed data onto page
>
>               while (length > 1)
>               {
>                       // Handle residual count, if any
>                       if (count > 0)
>                       {
>                 const USHORT l = MIN((USHORT) count, length - 1);
>                               USHORT n = l;
>                               do {
>                                       *--out = *--in;
>                               } while (--n);
>                               *--out = l;
>                               length -= (SSHORT) (l + 1);     // bytes 
> remaining on page
>                               count -= (SSHORT) l;    // bytes remaining in 
> run
>                               continue;
>                       }
>
>                       if ((count = *--control) < 0)
>                       {
>                               *--out = in[-1];
>                               *--out = count;
>                               in += count;
>                               length -= 2;
>                       }
>               }
>
>
> As I wrote, it is imposible change encoding without refactoring current
> code base.
>
> Slavek
>
> Ing. Slavomir Skopalik
> Executive Head
> Elekt Labs s.r.o.
> Collection and evaluation of data from machines and laboratories
> by means of system MASA (http://www.elektlabs.cz/m2demo)
> -----------------------------------------------------------------
> Address:
> Elekt Labs s.r.o.
> Chaloupky 158
> 783 72 Velky Tynec
> Czech Republic
> ---------------------------------------------------------------
> Mobile: +420 724 207 851
> icq:199 118 333skype:skopalikse-mail:skopa...@elektlabs.cz 
> <javascript:_e(%7B%7D,'cvml','e-mail:skopa...@elektlabs.cz');>http://www.elektlabs.cz
>
> On 28.2.2015 1:12, James Starkey wrote:
>
> First, I take personal offense at your characterization of my empncoding
> scheme as a hack.  It is not.  It is a carefully thought out scheme with
> muktiple implementations into three database systems.  It has been
> measured, compared, and extensively profiled.  I would be the last to cram
> it down someones throat, but it is not a hack and I resent it being
> referred to as such.
>
> Secondly, the tests of an encoding scheme are density, cost of encoding,
> and cost of decoding.  Your personal estimate of implementation cost
> doesn't enter the equation.
>
> What you consider normal for a Z80 doesn't carry all that much weight, at
> least to me.
>
>
> On Friday, February 27, 2015, Slavomir Skopalik <skopa...@elektlabs.cz> 
> <javascript:_e(%7B%7D,'cvml','skopa...@elektlabs.cz');>
> wrote:
>
>
>   Hi Jim,
> I will try explain.
>
> First, for any encoding schema, we need good interface that will be
> respected by all other parts of program.
> Now, the core of RLE is in one file, but some other parts of Firebird try
> to parse RLE directly.
> In this situation I need clean up code to use interface.
> For imagine what is not really correct, look here:
> dpm.cpp - static void store_big_record(thread_db* tdbb, record_param* rpb,
>                              PageStack& stack,
>                              DataComprControl* dcc, ULONG size)
>
> Second is encoding.
> I agree, that your schema is better.
> But currently is imposible to integrate into Firebird, because lack of
> interface.
> I will not replace one hack by another hack.
> Also, same encoding schema will be able to use for backup or wire protocol.
>
> Third:
> Why I'm diassgree with current system control chars stream generation.
> Current (FB2.5.3) allocate for control stream one half of record length.
> My RLE needs 66% of record length for control sream.
> Thats mean, you already allocated buffer with similar size as record
> length.
> But instead just copy, you will rescan, reallocate to get data that you
> can already have.
> CPU and HDD is worst, RAM is litle better (max 32kb safe during writing).
> I don't see any real benefit.
>
> Conclusion:
> Is it posible to change mechanism from control chars stream into packed
> stream (and create new interface for encoder/decoder) ?
> If yes, how can I help.
> If no, can be some hack like in store_big_record moved into SQZ ?
>
> History info: I was designed my RLE for Zilog Z80 CPU on ZX Spectrum in
> 80'.
> It is normally operate during commpression/decompression in same buffer.
>
> Is it clear ?
>
> Slavek
>
> Ing. Slavomir Skopalik
> Executive Head
> Elekt Labs s.r.o.
> Collection and evaluation of data from machines and laboratories
> by means of system MASA (http://www.elektlabs.cz/m2demo)
> -----------------------------------------------------------------
> Address:
> Elekt Labs s.r.o.
> Chaloupky 158
> 783 72 Velky Tynec
> Czech Republic
> ---------------------------------------------------------------
> Mobile: +420 724 207 851
> icq:199 118 333skype:skopalikse-mail:skopa...@elektlabs.cz 
> <javascript:_e(%7B%7D,'cvml','333skype:skopalikse-mail:skopa...@elektlabs.cz');>
>  
> <javascript:_e(%7B%7D,'cvml','e-mail:skopa...@elektlabs.cz');>http://www.elektlabs.cz
>
> On 27.2.2015 19:14, James Starkey wrote:
>
> Perhaps a smarter approach would be to capture the run lengths on the first
> scan to drive the encoding.  I vaguely remember that the code once did
> something like that.
>
> Could you describe your scheme and explain why it's better?  Run length
> encoding doesn't seem to lend itself to a lot of optimizations.  It's
> actually a bad scheme that just to be better than the alternatives (then
> available).
>
> Historical note:  The DEC JRD was part of disk engineering's database
> machine program.  The group manager was somewhat upset that we were doing
> data compression at all -- DEC, after all, sold disk drives.  I explained
> that it was really an important performance optimization to minimize disk
> read and writes, which seemed to have mollified him.  Before that, it just
> wasn't anything that database systems did.
>
> On Friday, February 27, 2015, Slavomir Skopalik <skopa...@elektlabs.cz> 
> <javascript:_e(%7B%7D,'cvml','skopa...@elektlabs.cz');> 
> <javascript:_e(%7B%7D,'cvml','skopa...@elektlabs.cz');>
> wrote:
>
>
>   Hi Jim,
> what happens in current Firebird if records not fit in buffer:
>
> 1. Scan and calculate commpress length
> 2. If not fit, than scan control buffer and calculate, how many bytes will
> fit + padding
> 3. Compress into small area (scan again)
> 4. Find another free space on data page and goto 1 with unprocessed part
> of record.
>
> I'm not sure, that is it faster than compress into buffer on stack, and
> made few moves.
>
> Why RLE now, because I have it, and I'm starting with FB sources two weeks
> ago.
> It was easy to adpot RLE, but it was hard to understand padding.
>
> Now, I woudlike to look into record encoding like you describe, but to be
> able to do,
> I have to understand, why it is designed as is.
>
> And another point of view,
> cost of changes was small and impact on size and speed high -> thats way I
> was did it.
>
> You proposal will needs much more works.
> >From my point of view, isn't realistic to do it into FB2.5x or FB3.
> When encoding will be implemented, will be nice to use it also for backup
> and wire protocol.
>
> Thank you for.
>
> Slavek
>
>
> On 27.2.2015 16:40, James Starkey wrote:
>
> The answer to your questions is simple:  It is much faster to encode from
> the original record onto the data pages(s), eliminating the need to
> allocate, populate, copy, and release a temporary buffer.
>
> And, frankly, the cost of a byte per full database page is not something to
> loose sleep over.
>
> The competitive for a different compression scheme isn't the 30 year old
> run length encoding but the self-describing, value driven encoding I
> described earlier.
>
> Another area where this is much room for improvement is the encoding of
> multi-column indexes.  There is a much more clever scheme that doesn't
> waste everything fifth byte.
>
> On Friday, February 27, 2015, Slavomir Skopalik <skopa...@elektlabs.cz> 
> <javascript:_e(%7B%7D,'cvml','skopa...@elektlabs.cz');> 
> <javascript:_e(%7B%7D,'cvml','skopa...@elektlabs.cz');> 
> <javascript:_e(%7B%7D,'cvml','skopa...@elektlabs.cz');>
> wrote:
>
>
>  Hi Vlad,
> as I see, in some situation (that really happen), packing into small
> area is padded by zeroes
> (uncomress prefix with zero length).
> And new control char added at begining next fragment (you will lost 2
> bytes).
> The differencies in current compression is not so much, but with better
> one is more significant.
>
> Finally, I still not understand, why is better to compress each fragment
> separatly, instead
> make one compressed block that will split into fragments.
>
> If we have routine to compress/encode full record, we can easyly replace
> curent RLE
> by any other encoding schemna.
>
> In current situation, is not easy replace corent RLE by value encoding
> schema.
>
> I finished new RLE, that is about 25% more efective than my previous post,
> but I lossing lot of bytes on padding and new headers (and also 1 byte
> per row to keep compatibility with previous DB).
>
> I will clean up code and post here durign few days.
>
> Also record differencies encoding can be improoved, I will do if
> somebody will need it.
>
> About update, I'm worry, that fragmented record will not add performace
> gain durign update.
>
> Slavek
>
>
>      Not exactly so. The big record is prepared for compression as a
>
>  whole, then
>
>  tail of record is packed and put at separate page(s) and finally what
>
>  left
>
>  (and could be put on single page) is really "re-compressed" separately.
>
>
>  And when record is materialized in RAM all parts are reads and
>
>  decompress
>
>  separatly.
>
>      What problem do you see here ? How else do you propose to decompress
>
>  fragmented
>
>  record ?
>
>
>
>  If comprossor cannot fit in small space, than rest of space is padded
> (char 0x0 is in use).
>
>      Record image in memory always have fixed length, according to record
>
>  format.
>
>   This wastage CPU and disk space.
>
>      CPU - yes, Memory - yes, Disk - no.
>
>     Also, note, it allows later to not waste CPU when fields are
>
>  accessed and
>
>  record is updated, AFAIU.
>
> Regards,
> Vlad
>
>
>
>  
> ------------------------------------------------------------------------------
>
>  Dive into the World of Parallel Programming The Go Parallel Website,
>
>  sponsored
>
>  by Intel and developed in partnership with Slashdot Media, is your hub
>
>  for all
>
>  things parallel software development, from weekly thought leadership
>
>  blogs to
>
>  news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
> Firebird-Devel mailing list, web interface at
>
>  https://lists.sourceforge.net/lists/listinfo/firebird-devel
>
>
>
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for
> all
> things parallel software development, from weekly thought leadership blogs
> to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
> Firebird-Devel mailing list, web interface 
> athttps://lists.sourceforge.net/lists/listinfo/firebird-devel
>
>
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming The Go Parallel Website, sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for all
> things parallel software development, from weekly thought leadership blogs to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
>
>
>
> Firebird-Devel mailing list, web interface at 
> https://lists.sourceforge.net/lists/listinfo/firebird-devel
>
>
>
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming The Go Parallel Website, sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for all
> things parallel software development, from weekly thought leadership blogs to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
>
>
>
> Firebird-Devel mailing list, web interface at 
> https://lists.sourceforge.net/lists/listinfo/firebird-devel
>
>
>
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming The Go Parallel Website, sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for all
> things parallel software development, from weekly thought leadership blogs to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
>
>
>
> Firebird-Devel mailing list, web interface at 
> https://lists.sourceforge.net/lists/listinfo/firebird-devel
>
>
>

-- 
Jim Starkey

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/

Firebird-Devel mailing list, web interface at 
https://lists.sourceforge.net/lists/listinfo/firebird-devel

Re: [Firebird-devel] Recore level compresion imroovement

Reply via email to