On Tue, 9 Apr 2019 at 02:27, Ashwin Agrawal <aagra...@pivotal.io> wrote:
> Heikki and I have been hacking recently for few weeks to implement > in-core columnar storage for PostgreSQL. Here's the design and initial > implementation of Zedstore, compressed in-core columnar storage (table > access method). Attaching the patch and link to github branch [1] to > follow along. > > The objective is to gather feedback on design and approach to the > same. The implementation has core basic pieces working but not close > to complete. > > Big thank you to Andres, Haribabu and team for the table access method > API's. Leveraged the API's for implementing zedstore, and proves API > to be in very good shape. Had to enhance the same minimally but > in-general didn't had to touch executor much. > > Motivations / Objectives > > * Performance improvement for queries selecting subset of columns > (reduced IO). > * Reduced on-disk footprint compared to heap table. Shorter tuple > headers and also leveraging compression of similar type data > * Be first-class citizen in the Postgres architecture (tables data can > just independently live in columnar storage) > * Fully MVCC compliant > * All Indexes supported > * Hybrid row-column store, where some columns are stored together, and > others separately. Provide flexibility of granularity on how to > divide the columns. Columns accessed together can be stored > together. > * Provide better control over bloat (similar to zheap) > * Eliminate need for separate toast tables > * Faster add / drop column or changing data type of column by avoiding > full rewrite of the table. > > High-level Design - B-trees for the win! > ======================================== > > To start simple, let's ignore column store aspect for a moment and > consider it as compressed row store. The column store is natural > extension of this concept, explained in next section. > > The basic on-disk data structure leveraged is a B-tree, indexed by > TID. BTree being a great data structure, fast and versatile. Note this > is not referring to existing Btree indexes, but instead net new > separate BTree for table data storage. > > TID - logical row identifier: > TID is just a 48-bit row identifier. The traditional division into > block and offset numbers is meaningless. In order to find a tuple with > a given TID, one must always descend the B-tree. Having logical TID > provides flexibility to move the tuples around different pages on page > splits or page merges can be performed. > > The internal pages of the B-tree are super simple and boring. Each > internal page just stores an array of TID and downlink pairs. Let's > focus on the leaf level. Leaf blocks have short uncompressed header, > followed by btree items. Two kinds of items exist: > > - plain item, holds one tuple or one datum, uncompressed payload > - a "container item", holds multiple plain items, compressed payload > > +----------------------------- > | Fixed-size page header: > | > | LSN > | TID low and hi key (for Lehman & Yao B-tree operations) > | left and right page pointers > | > | Items: > | > | TID | size | flags | uncompressed size | lastTID | payload (container > item) > | TID | size | flags | uncompressed size | lastTID | payload (container > item) > | TID | size | flags | undo pointer | payload (plain item) > | TID | size | flags | undo pointer | payload (plain item) > | ... > | > +---------------------------- > > Row store > --------- > > The tuples are stored one after another, sorted by TID. For each > tuple, we store its 48-bit TID, a undo record pointer, and the actual > tuple data uncompressed. > > In uncompressed form, the page can be arbitrarily large. But after > compression, it must fit into a physical 8k block. If on insert or > update of a tuple, the page cannot be compressed below 8k anymore, the > page is split. Note that because TIDs are logical rather than physical > identifiers, we can freely move tuples from one physical page to > another during page split. A tuple's TID never changes. > > The buffer cache caches compressed blocks. Likewise, WAL-logging, > full-page images etc. work on compressed blocks. Uncompression is done > on-the-fly, as and when needed in backend-private memory, when > reading. For some compressions like rel encoding or delta encoding > tuples can be constructed directly from compressed data. > > Column store > ------------ > > A column store uses the same structure but we have *multiple* B-trees, > one for each column, all indexed by TID. The B-trees for all columns > are stored in the same physical file. > > A metapage at block 0, has links to the roots of the B-trees. Leaf > pages look the same, but instead of storing the whole tuple, stores > just a single attribute. To reconstruct a row with given TID, scan > descends down the B-trees for all the columns using that TID, and > fetches all attributes. Likewise, a sequential scan walks all the > B-trees in lockstep. > > So, in summary can imagine Zedstore as forest of B-trees, one for each > column, all indexed by TIDs. > > This way of laying out the data also easily allows for hybrid > row-column store, where some columns are stored together, and others > have a dedicated B-tree. Need to have user facing syntax to allow > specifying how to group the columns. > > > Main reasons for storing data this way > -------------------------------------- > > * Layout the data/tuples in mapped fashion instead of keeping the > logical to physical mapping separate from actual data. So, keep the > meta-data and data logically in single stream of file, avoiding the > need for separate forks/files to store meta-data and data. > > * Stick to fixed size physical blocks. Variable size blocks pose need > for increased logical to physical mapping maintenance, plus > restrictions on concurrency of writes and reads to files. Hence > adopt compression to fit fixed size blocks instead of other way > round. > > > MVCC > ---- > MVCC works very similar to zheap for zedstore. Undo record pointers > are used to implement MVCC. Transaction information if not directly > stored with the data. In zheap, there's a small, fixed, number of > "transaction slots" on each page, but zedstore has undo pointer with > each item directly; in normal cases, the compression squeezes this > down to almost nothing. > > > Implementation > ============== > > Insert: > Inserting a new row, splits the row into datums. Then for first column > decide which block to insert the same to, and pick a TID for it, and > write undo record for the same. Rest of the columns are inserted using > that same TID and point to same undo position. > > Compression: > Items are added to Btree in uncompressed form. If page is full and new > item can't be added, compression kicks in. Existing uncompressed items > (plain items) of the page are passed to compressor for > compression. Already compressed items are added back as is. Page is > rewritten with compressed data with new item added to it. If even > after compression, can't add item to page, then page split happens. > > Toast: > When an overly large datum is stored, it is divided into chunks, and > each chunk is stored on a dedicated toast page within the same > physical file. The toast pages of a datum form list, each page has a > next/prev pointer. > > Select: > Property is added to Table AM to convey if column projection is > leveraged by AM for scans. While scanning tables with AM leveraging > this property, executor parses the plan. Leverages the target list and > quals to find the required columns for query. This list is passed down > to AM on beginscan. Zedstore uses this column projection list to only > pull data from selected columns. Virtual tuple table slot is used to > pass back the datums for subset of columns. > > Current table am API requires enhancement here to pass down column > projection to AM. The patch showcases two different ways for the same. > > * For sequential scans added new beginscan_with_column_projection() > API. Executor checks AM property and if it leverages column > projection uses this new API else normal beginscan() API. > > * For index scans instead of modifying the begin scan API, added new > API to specifically pass column projection list after calling begin > scan to populate the scan descriptor but before fetching the tuples. > > Index Support: > Building index also leverages columnar storage and only scans columns > required to build the index. Indexes work pretty similar to heap > tables. Data is inserted into tables and TID for the tuple same gets > stored in index. On index scans, required column Btrees are scanned > for given TID and datums passed back using virtual tuple. > > Page Format: > ZedStore table contains different kinds of pages, all in the same > file. Kinds of pages are meta-page, per-attribute btree internal and > leaf pages, UNDO log page, and toast pages. Each page type has its own > distinct data storage format. > > Block 0 is always a metapage. It contains the block numbers of the > other data structures stored within the file, like the per-attribute > B-trees, and the UNDO log. > > > Enhancements to design: > ======================= > > Instead of compressing all the tuples on a page in one batch, we could > store a small "dictionary", e.g. in page header or meta-page, and use > it to compress each tuple separately. That could make random reads and > updates of individual tuples faster. > > When adding column, just need to create new Btree for newly added > column and linked to meta-page. No existing content needs to be > rewritten. > > When the column is dropped, can scan the B-tree of that column, and > immediately mark all the pages as free in the FSM. But we don't > actually have to scan the leaf level: all leaf tuples have a downlink > in the parent, so we can scan just the internal pages. Unless the > column is very wide, that's only a small fraction of the data. That > makes the space immediately reusable for new insertions, but it won't > return the space to the Operating System. In order to do that, we'd > still need to defragment, moving pages from the end of the file closer > to the beginning, and truncate the file. > > In this design, we only cache compressed pages in the page cache. If > we want to cache uncompressed pages instead, or in addition to that, > we need to invent a whole new kind of a buffer cache that can deal > with the variable-size blocks. > > If you do a lot of updates, the file can get fragmented, with lots of > unused space on pages. Losing the correlation between TIDs and > physical order is also bad, because it will make SeqScans slow, as > they're not actually doing sequential I/O anymore. We can write a > defragmenter to fix things up. Half-empty pages can be merged, and > pages can be moved to restore TID/physical correlation. This format > doesn't have the same MVCC problems with moving tuples around that the > Postgres heap does, so it can be fairly easily be done on-line. > > Min-Max values can be stored for block to easily skip scanning if > column values doesn't fall in range. > > Notes about current patch > ========================= > > Basic (core) functionality is implemented to showcase and play with. > > Two compression algorithms are supported Postgres pg_lzcompress and > lz4. Compiling server with --with-lz4 enables the LZ4 compression for > zedstore else pg_lzcompress is default. Definitely LZ4 is super fast > at compressing and uncompressing. > > Not all the table AM API's are implemented. For the functionality not > implmented yet will ERROR out with not supported. Zedstore Table can > be created using command: > > CREATE TABLE <name> (column listing) USING zedstore; > > Bulk load can be performed using COPY. INSERT, SELECT, UPDATE and > DELETES work. Btree indexes can be created. Btree and bitmap index > scans work. Test in src/test/regress/sql/zedstore.sql showcases all > the functionality working currently. Updates are currently implemented > as cold, means always creates new items and not performed in-place. > > TIDs currently can't leverage the full 48 bit range but instead need > to limit to values which are considered valid ItemPointers. Also, > MaxHeapTuplesPerPage pose restrictions on the values currently it can > have. Refer [7] for the same. > > Extremely basic UNDO logging has be implemented just for MVCC > perspective. MVCC is missing tuple lock right now. Plus, doesn't > actually perform any undo yet. No WAL logging exist currently hence > its not crash safe either. > > Helpful functions to find how many pages of each type is present in > zedstore table and also to find compression ratio is provided. > > Test mentioned in thread "Column lookup in a row performance" [6], > good example query for zedstore locally on laptop using lz4 shows > > postgres=# SELECT AVG(i199) FROM (select i199 from layout offset 0) x; -- > heap > avg > --------------------- > 500000.500000000000 > (1 row) > > Time: 4679.026 ms (00:04.679) > > postgres=# SELECT AVG(i199) FROM (select i199 from zlayout offset 0) x; -- > zedstore > avg > --------------------- > 500000.500000000000 > (1 row) > > Time: 379.710 ms > > Important note: > --------------- > Planner has not been modified yet to leverage the columnar > storage. Hence, plans using "physical tlist" optimization or such good > for row store miss out to leverage the columnar nature > currently. Hence, can see the need for subquery with OFFSET 0 above to > disable the optimization and scan only required column. > > > > The current proposal and discussion is more focused on AM layer work > first. Hence, currently intentionally skips to discuss the planner or > executor "feature" enhancements like adding vectorized execution and > family of features. > > Previous discussions or implementations for column store Vertical > cluster index [2], Incore columnar storage [3] and [4], cstore_fdw [5] > were refered to distill down objectives and come up with design and > implementations to avoid any earlier concerns raised. Learnings from > Greenplum Database column store also leveraged while designing and > implementing the same. > > Credit: Design is moslty brain child of Heikki, or actually his > epiphany to be exact. I acted as idea bouncing board and contributed > enhancements to the same. We both are having lot of fun writing the > code for this. > > > References > 1] https://github.com/greenplum-db/postgres/tree/zedstore > 2] > https://www.postgresql.org/message-id/flat/CAJrrPGfaC7WC9NK6PTTy6YN-NN%2BhCy8xOLAh2doYhVg5d6HsAA%40mail.gmail.com > 3] > https://www.postgresql.org/message-id/flat/20150611230316.GM133018%40postgresql.org > 4] > https://www.postgresql.org/message-id/flat/20150831225328.GM2912%40alvherre.pgsql > 5] https://github.com/citusdata/cstore_fdw > 6] > https://www.postgresql.org/message-id/flat/CAOykqKfko-n5YiBJtk-ocVdp%2Bj92Apu5MJBwbGGh4awRY5NCuQ%40mail.gmail.com > 7] > https://www.postgresql.org/message-id/d0fc97bd-7ec8-2388-e4a6-0fda86d71a43%40iki.fi > > Reading about it reminds me of this work -- TAG column storage( http://www09.sigmod.org/sigmod/record/issues/0703/03.article-graefe.pdf ). Isn't this storage system inspired from there, with TID as the TAG? It is not referenced here so made me wonder. -- Regards, Rafia Sabih