[MariaDB discuss] Re: Rocksdb as a replacement for toku?

Gordan Bobic via discuss Thu, 24 Aug 2023 08:53:14 -0700

Or you could just put it on ZFS and let it do all of the difficult
stuff for you. :-)


On Thu, Aug 24, 2023 at 6:40 PM pslawek83 <[email protected]> wrote:
>
> Hi Marko, thanks for the response ... so for the DOCS part
>
> > https://mariadb.com/kb/en/innodb-page-compression/
> > When InnoDB page compression is used, InnoDB may still write the compressed 
> > page to the tablespace file with the original size of the uncompressed 
> > page, which would be equivalent to the value of the innodb_page_size system 
> > variable. This is done by design, because when InnoDB's I/O code needs to 
> > read the page from disk, it can only read the full page size. However, this 
> > is obviously not optimal.
> > This punch hole technique allows InnoDB to read the compressed page from 
> > disk as the full page size, even though the compressed page really takes up 
> > less space on the file system.
>
> So as far as i understand if compression is enabled the server will write eg. 
> 40% utilized page which will later be decompressed when read, but this 
> compression/decompression won't have any advangage if there's no punch hole 
> support. The only difference will be that the block will be filled with zeros 
> for the empty space at the end. Or is it possible for PAGE_COMPRESSION to 
> still be used without punch hole (just not that effective for example)?
>
> Also tested this row compression, and yes page compression seems much better. 
> Is it possible to get rid of punch hole by just re-mapping the pages inside 
> these innodb files? I was thinking about this... if it's possible to change 
> the data structure to divide the file into 512b blocks while the beginning of 
> each block will mark page number (0xffffffff for blocks which are part of 
> previous page). To increase page size we'd need to add another page with same 
> number at the end of the file, later continuous recycled blocks can be added 
> to a list and used to store smaller pages. To make this atomic, first whole 
> block would be allocated and written with 0xfffffffe as a dummy id and then 
> this placeholder id could be replaced with real block number).
>
> The problem with this could be that the file would get 100% fragmented over 
> time (but probably will work ok with NVME). The mapping could be read back by 
> reading whole datafile during server start and it'd need 125MB of memory for 
> 1TB of compressed data to store the mapping in memory for fast access.
>
> What you thinik? Would it be easy to add it on top of current code (if 
> there's any API maybe, so for example i can do this) and would it even make 
> sense?
>
> Best,
> Slawomir.
>
> Dnia 21 sierpnia 2023 07:08 Marko Mäkelä <[email protected]> 
> napisał(a):
>
> On Sun, Aug 20, 2023 at 6:12 PM pslawek83 via discuss
> <[email protected]> wrote:
>
> Thanks everyone for the answers. Tested this PAGE_COMPRESSION a bit and make 
> some reading. Not sure if I fully understand how that works... is that 
> correct?
> 1. Impossible without punch hole as if we don't have it the server will just 
> write partially used pages which will take full page size anyway. But in docs 
> it seems it described its use with punch hole disabled. What's the reason to 
> still have page compression without punch hole?
>
> Can you please point to the misleading part of the documentation so
> that it can be corrected?
>
> 2. After enabling compression each page needs to be trimmed using punch hole, 
> so if we have 1 TB database and 64k blocks, the database files will have over 
> 15 million fragments (1000000000000/64000)
>
> That is correct.
>
> So if we get like 50% compression and then we'll be able to put another 1TB 
> of data into these holes we'll be having close to 31 million file fragments. 
> I'm not sure if under such conditions the FS will be able to even allocate 
> space efficiently anymore, copy these files, etc. Or im not getting something?
>
> For what it is worth, around the same time when the page_compressed
> was being developed for MariaDB Server 10.1, something similar was
> being worked on at Oracle, I think for MySQL 5.7. To my understanding,
> the same hardware partner worked with both companies. The obvious
> answer to 
> https://smalldatum.blogspot.com/2015/10/wanted-file-system-on-which-innodb.html
> would have been https://en.wikipedia.org/wiki/Fusion-io_NVMFS if it
> had worked without kernel panics or existed in a mainline kernel. I
> don’t even know if it ever was open sourced.
>
> 3. I was testing compression of ~1.5GB table with nice results on default 
> 16kb page size. One thing i noticed is that l was always getting exactly the 
> same compressed size using `du`. However the table read/copy times were 
> different from 40s for uncompressed to 80-400 seconds depending on algo. Is 
> that possible that is because page size is too low?
>
> As far as I remember, most compression algorithms that are derivatives
> of the LZ77 family have an input buffer size of 32 KiB. From that
> point of view, it would seem to make sense to use innodb_page_size=32k
> or innodb_page_size=64k instead of the default. Furthermore, I think
> that it was a mistake to implement bzip2 support at all. That
> algorithm would work with much larger buffer sizes.
>
> 4. Myrocks seems to be very slow compared to compressed inno. Was also 
> reading docs related to it. For its usecase is it more suitable for storing 
> large rows rather than many small rows?
>
> To my understanding, the LSM trees that MyRocks is built on work best
> in an insert-only workload. On top of LSM trees, MyRocks implements
> some caches and Bloom filters. If you have few updates or deletes and
> most reads are accessing recently inserted data, it could work.
>
> 5. I remember reading that some innodb compression is going to be obsolete. 
> Is that row compression? Or there are no plans to obsolete any compression 
> model?
>
> I wanted to deprecate and remove the ROW_FORMAT=COMPRESSED that I
> designed and implemented in the InnoDB Plugin for MySQL 5.1 based on
> some high level ideas of Heikki Tuuri between 2005 and 2007.
> https://jira.mariadb.org/browse/MDEV-22367 was the first step towards
> that. Based on community feedback, this won’t be happening. Even after
> some data structure cleanup that took place in MariaDB Server 10.6,
> supporting the format incurs an overhead of some 32 to 40 bytes per
> buffer page descriptor, which is not insignificant. Removing both
> ROW_FORMAT=COMPRESSED and the adaptive hash index would shrink the
> block descriptor to 80 bytes from its current 10.6 size of 160 bytes.
> The write performance with ROW_FORMAT=COMPRESSED should have been
> degraded when I implemented the easier-to-parse log record format in
> MDEV-12353 (MariaDB Server 10.5). There are open bugs about
> ROW_FORMAT=COMPRESSED, such as
> https://jira.mariadb.org/browse/MDEV-31574. I would say that the
> InnoDB ROW_FORMAT=COMPRESSED seemed like a good idea back when it was
> implemented, when HDD storage was the only practical option. Just like
> the InnoDB change buffer, which I removed in MariaDB Server 11.0 after
> having to deal with too many cases of corruption:
> https://fosdem.org/2023/schedule/event/innodb_change_buffer/
> With best regards,
> Marko
> --
> Marko Mäkelä, Lead Developer InnoDB
> MariaDB plc
>
>
_______________________________________________
discuss mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[MariaDB discuss] Re: Rocksdb as a replacement for toku?

Reply via email to