On Tue, Apr 9, 2019 at 5:57 AM Ashwin Agrawal <aagra...@pivotal.io> wrote: > > Heikki and I have been hacking recently for few weeks to implement > in-core columnar storage for PostgreSQL. Here's the design and initial > implementation of Zedstore, compressed in-core columnar storage (table > access method). Attaching the patch and link to github branch [1] to > follow along. > > The objective is to gather feedback on design and approach to the > same. The implementation has core basic pieces working but not close > to complete. > > Big thank you to Andres, Haribabu and team for the table access method > API's. Leveraged the API's for implementing zedstore, and proves API > to be in very good shape. Had to enhance the same minimally but > in-general didn't had to touch executor much. > > Motivations / Objectives > > * Performance improvement for queries selecting subset of columns > (reduced IO). > * Reduced on-disk footprint compared to heap table. Shorter tuple > headers and also leveraging compression of similar type data > * Be first-class citizen in the Postgres architecture (tables data can > just independently live in columnar storage) > * Fully MVCC compliant > * All Indexes supported > * Hybrid row-column store, where some columns are stored together, and > others separately. Provide flexibility of granularity on how to > divide the columns. Columns accessed together can be stored > together. > * Provide better control over bloat (similar to zheap) > * Eliminate need for separate toast tables > * Faster add / drop column or changing data type of column by avoiding > full rewrite of the table. > > High-level Design - B-trees for the win! > ======================================== > > To start simple, let's ignore column store aspect for a moment and > consider it as compressed row store. The column store is natural > extension of this concept, explained in next section. > > The basic on-disk data structure leveraged is a B-tree, indexed by > TID. BTree being a great data structure, fast and versatile. Note this > is not referring to existing Btree indexes, but instead net new > separate BTree for table data storage. > > TID - logical row identifier: > TID is just a 48-bit row identifier. The traditional division into > block and offset numbers is meaningless. In order to find a tuple with > a given TID, one must always descend the B-tree. Having logical TID > provides flexibility to move the tuples around different pages on page > splits or page merges can be performed. > > The internal pages of the B-tree are super simple and boring. Each > internal page just stores an array of TID and downlink pairs. Let's > focus on the leaf level. Leaf blocks have short uncompressed header, > followed by btree items. Two kinds of items exist: > > - plain item, holds one tuple or one datum, uncompressed payload > - a "container item", holds multiple plain items, compressed payload > > +----------------------------- > | Fixed-size page header: > | > | LSN > | TID low and hi key (for Lehman & Yao B-tree operations) > | left and right page pointers > | > | Items: > | > | TID | size | flags | uncompressed size | lastTID | payload (container > item) > | TID | size | flags | uncompressed size | lastTID | payload (container > item) > | TID | size | flags | undo pointer | payload (plain item) > | TID | size | flags | undo pointer | payload (plain item) > | ... > | > +---------------------------- > > Row store > --------- > > The tuples are stored one after another, sorted by TID. For each > tuple, we store its 48-bit TID, a undo record pointer, and the actual > tuple data uncompressed. >
Storing undo record pointer with each tuple can take quite a lot of space in cases where you can't compress them. Have you thought how will you implement the multi-locker scheme in this design? In zheap, we have used undo for the same and it is easy to imagine when you have separate transaction slots for each transaction. I am not sure how will you implement the same here. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com