Hi, In the pluggable storage unconference session in Ottawa, column oriented storage was a key point of discussion. We would like to share an overview of Greenplum's column store. We hope that this will aid a discussion to carve out a pluggable storage interface. This is just an overview, source code is where the truth is: https://github.com/greenplum-db/gpdb
Greenplum introduced "appendonly" as a table type to support row as well as column orientation. Design goal for appendonly tables is to reduce I/O cost so that workloads involving bulk I/O (OLAP) run faster. As the name suggests, tuples stored in the underlying files are immutable (no in-place updates / deletes). The files can only be extended with new data. We will focus only on column oriented appendonly tables in what follows. A create table SQL command to create column oriented table: CREATE TABLE col_table (c1 int, c2 varchar) WITH (appendonly=true, orientation=column); Storage format: Each column of a column oriented table gets a dedicated file in the data directory. A file contains a series of variable length blocks ("varblocks"). Each concurrent transaction writing to an appendonly table is allocated a dedicated set of files, one per column. At the most, 128 concurrent transactions are allowed to write to the same appendonly table, leading to at the most 128 groups of files (one per column). Let's assume two concurrent transactions inserted into the "col_table" in the above example. The table will have two groups of files, each group containing two files (one per column). A group of files is called "segment". Segment 1 for "col_table" contains files "<relfilenode>.1" (c1) and "<relfilenode>.129" (c2). Segment 2 contains files "<relfilenode>.2" (c1) and "<relfilenode>.130" (c2). Appendonly tuples do not contain MVCC information (xmin/xmax). An auxiliary table, pg_aoseg.pg_aoseg_<oid>, is created for each appendonly table. It keeps track of the length of each segment file for the appendonly table. The aoseg table is a heap table, normal MVCC rules apply. This setup isolates transactions concurrently reading or writing to an appendonly table from each other. Similar to heap TID, appendonly TID is a 6-byte struct (appendonlytid.h). It comprises of segment number and row number. Appendonly TID is recorded in indexes without much distinction from heap TID. Unlike heap TID, an appendonly TID is not sufficient to locate a row number within a column's file. Another auxiliary table, block directory (pg_aoseg.pg_aoblckdir_<oid>), maps a row number to offsets in the segment. The offset points to start of the varblock containing the row number. Index lookups first obtain the appendonly TID from the index, then map the row number to its offset using block directory and then read the varblock from the file. The aoblckdir table is heap, normal MVCC rules apply. UPDATE and DELETE: Updates and deletes are supported by maintaining a bitmap representing visibility of appendonly row numbers. Note that visibility is attribute of a row. E.g. if row number 10 is invisible in "col_table", the column values corresponding to row number 10 in both "<relfilenode>.1" as well as "<relfilenode>.129" files should be considered invisible. The bitmap is maintained in another auxiliary table - pg_aoseg.aovisimap, which is also a normal heap table. Deleting a row results in no change to the appendonly files. The bit representing the deleted row number is set in the aovisimap table. UPDATE is delete followed by insert. No update chains are maintained. Catalog changes: * pg_appenonly table contains OIDs of auxiliary tables for each appendonly table. * For each appendonly table, the auxiliary tables created are: aoseg, aovisimap and block directory. Similar to toast tables, these have their own entires in pg_class with their own relfilenodes. * New pg_class.relstorage values 'a' for row oriented AO and 'c' for column oriented. Compression: Each column can be compressed. Supported compression algorithms are zlib, zstd, RLE (run length encoding), delta compression on top of RLE. Buffer management: Backend private memory is used to buffer contents of appendonly files. Block size (a multiple of 8K) can be specified per column. The amount of memory buffered per file is equal to the block size configured for that column. Executor touch points: Column store offers access methods such as aocs_beginscan(), aocs_getnext(), aocs_fetch(), aocs_rescan(), aocs_insert(). One may come across a bunch of conditionals sprinkled around executor code to invoke either heap, appendonly row or column store access methods. A better abstraction layer would have made the code much cleaner but we don't have it today. Scan of column oriented tables is performed by reading the projected columns' files in a loop. Note that the values stored in the files contain only data, no MVCC information is stored. Scan node returns a tuple table slot, just like in case of heap tables. The slot contains virtual tuples. The values and nulls arrays contain the values read from respective files. If needed during execution, the virtual tuples are converted to what we call "memtuples". Memtuples are similar to minimal tuples but supposedly more optimized for accessing individual datums. Vacuum: Vacuum operates on one appendonly segment at a time. The appendonly segment is locked such that concurrent insert/update/delete are blocked from operating on this segment. The segment is scanned using aocs_getnext() and aovisimap is used to filter out invisible row numbers. Only the visible row numbers are written to a new segment. After the scan is finished, the old segment is marked for deletion in pg_aocsseg_<oid> table. Transactions reading the table do not read files belonging to segments that are marked for deletion. After all segments have been scanned, all auxiliary tables (aocsseg, aovisimap, block directory and toast) are vacuumed. Problem worth highlighting: The design of aovisimap table poses hurdles for concurrent update and delete operations on the same appendonly table. One aovismap entry covers multiple row numbers in the appendonly table. If two concurrent transactions delete or update different row numbers such that they are covered by the same aovisimap tuple, one of them must abort. Not having chained updated versions of the same tuple. the aocs_udpate() interface differs from the heap counterpart, leading to anomalies such as https://groups.google.com/a/greenplum.org/d/msg/gpdb-dev/73LOuQFkIps/1TNHW0lGAwAJ Unique indexes are not supported on column oriented tables. Identifying a concurrent transaction that might be inserting duplicate tuples becomes hard because (1) lack of MVCC information in tuples and (2) one block directory tuple covers multiple row numbers. Asim (on behalf of Greenplum team)