On Tue, 9 Apr 2019 at 02:27, Ashwin Agrawal <aagra...@pivotal.io> wrote:

> Heikki and I have been hacking recently for few weeks to implement
> in-core columnar storage for PostgreSQL. Here's the design and initial
> implementation of Zedstore, compressed in-core columnar storage (table
> access method). Attaching the patch and link to github branch [1] to
> follow along.
>
> The objective is to gather feedback on design and approach to the
> same. The implementation has core basic pieces working but not close
> to complete.
>
> Big thank you to Andres, Haribabu and team for the table access method
> API's. Leveraged the API's for implementing zedstore, and proves API
> to be in very good shape. Had to enhance the same minimally but
> in-general didn't had to touch executor much.
>
> Motivations / Objectives
>
> * Performance improvement for queries selecting subset of columns
>   (reduced IO).
> * Reduced on-disk footprint compared to heap table. Shorter tuple
>   headers and also leveraging compression of similar type data
> * Be first-class citizen in the Postgres architecture (tables data can
>   just independently live in columnar storage)
> * Fully MVCC compliant
> * All Indexes supported
> * Hybrid row-column store, where some columns are stored together, and
>   others separately. Provide flexibility of granularity on how to
>   divide the columns. Columns accessed together can be stored
>   together.
> * Provide better control over bloat (similar to zheap)
> * Eliminate need for separate toast tables
> * Faster add / drop column or changing data type of column by avoiding
>   full rewrite of the table.
>
> High-level Design - B-trees for the win!
> ========================================
>
> To start simple, let's ignore column store aspect for a moment and
> consider it as compressed row store. The column store is natural
> extension of this concept, explained in next section.
>
> The basic on-disk data structure leveraged is a B-tree, indexed by
> TID. BTree being a great data structure, fast and versatile. Note this
> is not referring to existing Btree indexes, but instead net new
> separate BTree for table data storage.
>
> TID - logical row identifier:
> TID is just a 48-bit row identifier. The traditional division into
> block and offset numbers is meaningless. In order to find a tuple with
> a given TID, one must always descend the B-tree. Having logical TID
> provides flexibility to move the tuples around different pages on page
> splits or page merges can be performed.
>
> In my understanding these TIDs will follow the datatype of the current
ones. Then my question is will TIDs be reusable here and how will the
reusable range of TIDs be determined? If not, wouldn't that become a hard
limit to the number of insertions performed on a table?

The internal pages of the B-tree are super simple and boring. Each
> internal page just stores an array of TID and downlink pairs. Let's
> focus on the leaf level. Leaf blocks have short uncompressed header,
> followed by btree items. Two kinds of items exist:
>
>  - plain item, holds one tuple or one datum, uncompressed payload
>  - a "container item", holds multiple plain items, compressed payload
>
> +-----------------------------
> | Fixed-size page header:
> |
> |   LSN
> |   TID low and hi key (for Lehman & Yao B-tree operations)
> |   left and right page pointers
> |
> | Items:
> |
> |   TID | size | flags | uncompressed size | lastTID | payload (container
> item)
> |   TID | size | flags | uncompressed size | lastTID | payload (container
> item)
> |   TID | size | flags | undo pointer | payload (plain item)
> |   TID | size | flags | undo pointer | payload (plain item)
> |   ...
> |
> +----------------------------
>
> Row store
> ---------
>
> The tuples are stored one after another, sorted by TID. For each
> tuple, we store its 48-bit TID, a undo record pointer, and the actual
> tuple data uncompressed.
>
> In uncompressed form, the page can be arbitrarily large. But after
> compression, it must fit into a physical 8k block. If on insert or
> update of a tuple, the page cannot be compressed below 8k anymore, the
> page is split. Note that because TIDs are logical rather than physical
> identifiers, we can freely move tuples from one physical page to
> another during page split. A tuple's TID never changes.
>
> The buffer cache caches compressed blocks. Likewise, WAL-logging,
> full-page images etc. work on compressed blocks. Uncompression is done
> on-the-fly, as and when needed in backend-private memory, when
> reading. For some compressions like rel encoding or delta encoding
> tuples can be constructed directly from compressed data.
>
> Column store
> ------------
>
> A column store uses the same structure but we have *multiple* B-trees,
> one for each column, all indexed by TID. The B-trees for all columns
> are stored in the same physical file.
>
> A metapage at block 0, has links to the roots of the B-trees. Leaf
> pages look the same, but instead of storing the whole tuple, stores
> just a single attribute. To reconstruct a row with given TID, scan
> descends down the B-trees for all the columns using that TID, and
> fetches all attributes. Likewise, a sequential scan walks all the
> B-trees in lockstep.
>
> So, in summary can imagine Zedstore as forest of B-trees, one for each
> column, all indexed by TIDs.
>
> This way of laying out the data also easily allows for hybrid
> row-column store, where some columns are stored together, and others
> have a dedicated B-tree. Need to have user facing syntax to allow
> specifying how to group the columns.
>
>
> Main reasons for storing data this way
> --------------------------------------
>
> * Layout the data/tuples in mapped fashion instead of keeping the
>   logical to physical mapping separate from actual data. So, keep the
>   meta-data and data logically in single stream of file, avoiding the
>   need for separate forks/files to store meta-data and data.
>
> * Stick to fixed size physical blocks. Variable size blocks pose need
>   for increased logical to physical mapping maintenance, plus
>   restrictions on concurrency of writes and reads to files. Hence
>   adopt compression to fit fixed size blocks instead of other way
>   round.
>
>
> MVCC
> ----
> MVCC works very similar to zheap for zedstore. Undo record pointers
> are used to implement MVCC. Transaction information if not directly
> stored with the data. In zheap, there's a small, fixed, number of
> "transaction slots" on each page, but zedstore has undo pointer with
> each item directly; in normal cases, the compression squeezes this
> down to almost nothing.
>
> How about using a separate BTree for undo also?

> Implementation
> ==============
>
> Insert:
> Inserting a new row, splits the row into datums. Then for first column
> decide which block to insert the same to, and pick a TID for it, and
> write undo record for the same. Rest of the columns are inserted using
> that same TID and point to same undo position.
>
> Compression:
> Items are added to Btree in uncompressed form. If page is full and new
> item can't be added, compression kicks in. Existing uncompressed items
> (plain items) of the page are passed to compressor for
> compression. Already compressed items are added back as is. Page is
> rewritten with compressed data with new item added to it. If even
> after compression, can't add item to page, then page split happens.
>
> Toast:
> When an overly large datum is stored, it is divided into chunks, and
> each chunk is stored on a dedicated toast page within the same
> physical file. The toast pages of a datum form list, each page has a
> next/prev pointer.
>
> Select:
> Property is added to Table AM to convey if column projection is
> leveraged by AM for scans. While scanning tables with AM leveraging
> this property, executor parses the plan. Leverages the target list and
> quals to find the required columns for query. This list is passed down
> to AM on beginscan. Zedstore uses this column projection list to only
> pull data from selected columns. Virtual tuple table slot is used to
> pass back the datums for subset of columns.
>
> I am curious about how delete is working here? Will the TID entries will
be just marked delete as in current heap, or will they be actually removed
and whole btree is restructured (if required) then?
Similarly, about updates, will they be just delete+insert or something
clever will be happening there?
Will there be in-place updates and in what scenarios they will be possible?
There is nothing mentioned in this direction, however using undo files
assures me there must be some in-place updates somewhere.

>
> Enhancements to design:
> =======================
>
> Instead of compressing all the tuples on a page in one batch, we could
> store a small "dictionary", e.g. in page header or meta-page, and use
> it to compress each tuple separately. That could make random reads and
> updates of individual tuples faster.
>
> When adding column, just need to create new Btree for newly added
> column and linked to meta-page. No existing content needs to be
> rewritten.
>
> When the column is dropped, can scan the B-tree of that column, and
> immediately mark all the pages as free in the FSM. But we don't
> actually have to scan the leaf level: all leaf tuples have a downlink
> in the parent, so we can scan just the internal pages. Unless the
> column is very wide, that's only a small fraction of the data. That
> makes the space immediately reusable for new insertions, but it won't
> return the space to the Operating System. In order to do that, we'd
> still need to defragment, moving pages from the end of the file closer
> to the beginning, and truncate the file.
>
> In this design, we only cache compressed pages in the page cache. If
> we want to cache uncompressed pages instead, or in addition to that,
> we need to invent a whole new kind of a buffer cache that can deal
> with the variable-size blocks.
>
> If you do a lot of updates, the file can get fragmented, with lots of
> unused space on pages. Losing the correlation between TIDs and
> physical order is also bad, because it will make SeqScans slow, as
> they're not actually doing sequential I/O anymore. We can write a
> defragmenter to fix things up. Half-empty pages can be merged, and
> pages can be moved to restore TID/physical correlation. This format
> doesn't have the same MVCC problems with moving tuples around that the
> Postgres heap does, so it can be fairly easily be done on-line.
>
> Min-Max values can be stored for block to easily skip scanning if
> column values doesn't fall in range.
>
> Notes about current patch
> =========================
>
> Basic (core) functionality is implemented to showcase and play with.
>
> Two compression algorithms are supported Postgres pg_lzcompress and
> lz4. Compiling server with --with-lz4 enables the LZ4 compression for
> zedstore else pg_lzcompress is default. Definitely LZ4 is super fast
> at compressing and uncompressing.
>
> Not all the table AM API's are implemented. For the functionality not
> implmented yet will ERROR out with not supported. Zedstore Table can
> be created using command:
>
> CREATE TABLE <name> (column listing) USING zedstore;
>
> Bulk load can be performed using COPY. INSERT, SELECT, UPDATE and
> DELETES work. Btree indexes can be created. Btree and bitmap index
> scans work. Test in src/test/regress/sql/zedstore.sql showcases all
> the functionality working currently. Updates are currently implemented
> as cold, means always creates new items and not performed in-place.
>
> TIDs currently can't leverage the full 48 bit range but instead need
> to limit to values which are considered valid ItemPointers. Also,
> MaxHeapTuplesPerPage pose restrictions on the values currently it can
> have. Refer [7] for the same.
>
> Extremely basic UNDO logging has be implemented just for MVCC
> perspective. MVCC is missing tuple lock right now. Plus, doesn't
> actually perform any undo yet. No WAL logging exist currently hence
> its not crash safe either.
>
> Helpful functions to find how many pages of each type is present in
> zedstore table and also to find compression ratio is provided.
>
> Test mentioned in thread "Column lookup in a row performance" [6],
> good example query for zedstore locally on laptop using lz4 shows
>
> postgres=# SELECT AVG(i199) FROM (select i199 from layout offset 0) x; --
> heap
>          avg
> ---------------------
>  500000.500000000000
> (1 row)
>
> Time: 4679.026 ms (00:04.679)
>
> postgres=# SELECT AVG(i199) FROM (select i199 from zlayout offset 0) x; --
> zedstore
>          avg
> ---------------------
>  500000.500000000000
> (1 row)
>
> Time: 379.710 ms
>
> Important note:
> ---------------
> Planner has not been modified yet to leverage the columnar
> storage. Hence, plans using "physical tlist" optimization or such good
> for row store miss out to leverage the columnar nature
> currently. Hence, can see the need for subquery with OFFSET 0 above to
> disable the optimization and scan only required column.
>
>
>
> The current proposal and discussion is more focused on AM layer work
> first. Hence, currently intentionally skips to discuss the planner or
> executor "feature" enhancements like adding vectorized execution and
> family of features.
>
> Previous discussions or implementations for column store Vertical
> cluster index [2], Incore columnar storage [3] and [4], cstore_fdw [5]
> were refered to distill down objectives and come up with design and
> implementations to avoid any earlier concerns raised. Learnings from
> Greenplum Database column store also leveraged while designing and
> implementing the same.
>
> Credit: Design is moslty brain child of Heikki, or actually his
> epiphany to be exact. I acted as idea bouncing board and contributed
> enhancements to the same. We both are having lot of fun writing the
> code for this.
>
>
> References
> 1] https://github.com/greenplum-db/postgres/tree/zedstore
> 2]
> https://www.postgresql.org/message-id/flat/CAJrrPGfaC7WC9NK6PTTy6YN-NN%2BhCy8xOLAh2doYhVg5d6HsAA%40mail.gmail.com
> 3]
> https://www.postgresql.org/message-id/flat/20150611230316.GM133018%40postgresql.org
> 4]
> https://www.postgresql.org/message-id/flat/20150831225328.GM2912%40alvherre.pgsql
> 5] https://github.com/citusdata/cstore_fdw
> 6]
> https://www.postgresql.org/message-id/flat/CAOykqKfko-n5YiBJtk-ocVdp%2Bj92Apu5MJBwbGGh4awRY5NCuQ%40mail.gmail.com
> 7]
> https://www.postgresql.org/message-id/d0fc97bd-7ec8-2388-e4a6-0fda86d71a43%40iki.fi
>
>

-- 
Regards,
Rafia Sabih

Reply via email to