Column store in Greenplum

Asim R P Thu, 07 Jun 2018 14:35:37 -0700

Hi,

In the pluggable storage unconference session in Ottawa,
column oriented storage was a key point of discussion.  We would like
to share an overview of Greenplum's column store.  We hope that this
will aid a discussion to carve out a pluggable storage interface.
This is just an overview, source code is where the truth is:
https://github.com/greenplum-db/gpdb


Greenplum introduced "appendonly" as a table type to support row as
well as column orientation.  Design goal for appendonly tables is to
reduce I/O cost so that workloads involving bulk I/O (OLAP) run
faster.  As the name suggests, tuples stored in the underlying files
are immutable (no in-place updates / deletes).  The files can only be
extended with new data.  We will focus only on column oriented
appendonly tables in what follows.

A create table SQL command to create column oriented table:

    CREATE TABLE col_table (c1 int, c2 varchar) WITH (appendonly=true,
orientation=column);

Storage format:

Each column of a column oriented table gets a dedicated file in the
data directory.  A file contains a series of variable length blocks
("varblocks").  Each concurrent transaction writing to an appendonly
table is allocated a dedicated set of files, one per column.  At the
most, 128 concurrent transactions are allowed to write to the same
appendonly table, leading to at the most 128 groups of files (one per
column).  Let's assume two concurrent transactions inserted into the
"col_table" in the above example.  The table will have two groups of
files, each group containing two files (one per column).  A group of
files is called "segment".  Segment 1 for "col_table" contains files
"<relfilenode>.1" (c1) and "<relfilenode>.129" (c2).  Segment 2
contains files "<relfilenode>.2" (c1) and "<relfilenode>.130" (c2).

Appendonly tuples do not contain MVCC information (xmin/xmax).  An
auxiliary table, pg_aoseg.pg_aoseg_<oid>, is created for each
appendonly table.  It keeps track of the length of each segment file
for the appendonly table.  The aoseg table is a heap table, normal
MVCC rules apply.  This setup isolates transactions concurrently
reading or writing to an appendonly table from each other.

Similar to heap TID, appendonly TID is a 6-byte struct
(appendonlytid.h).  It comprises of segment number and row number.
Appendonly TID is recorded in indexes without much distinction from
heap TID.  Unlike heap TID, an appendonly TID is not sufficient to
locate a row number within a column's file.  Another auxiliary table,
block directory (pg_aoseg.pg_aoblckdir_<oid>), maps a row number to
offsets in the segment.  The offset points to start of the varblock
containing the row number.  Index lookups first obtain the appendonly
TID from the index, then map the row number to its offset using block
directory and then read the varblock from the file.  The aoblckdir
table is heap, normal MVCC rules apply.

UPDATE and DELETE:

Updates and deletes are supported by maintaining a bitmap representing
visibility of appendonly row numbers.  Note that visibility is
attribute of a row.  E.g. if row number 10 is invisible in
"col_table", the column values corresponding to row number 10 in both
"<relfilenode>.1" as well as "<relfilenode>.129" files should be
considered invisible.  The bitmap is maintained in another auxiliary
table - pg_aoseg.aovisimap, which is also a normal heap table.
Deleting a row results in no change to the appendonly files.  The bit
representing the deleted row number is set in the aovisimap table.
UPDATE is delete followed by insert.  No update chains are maintained.

Catalog changes:

* pg_appenonly table contains OIDs of auxiliary tables for each
appendonly table.
* For each appendonly table, the auxiliary tables created are: aoseg,
aovisimap and block directory.  Similar to toast tables, these have
their own entires in pg_class with their own relfilenodes.
* New pg_class.relstorage values 'a' for row oriented AO and 'c' for
column oriented.

Compression:

Each column can be compressed.  Supported compression algorithms are
zlib, zstd, RLE (run length encoding), delta compression on top of
RLE.

Buffer management:

Backend private memory is used to buffer contents of appendonly files.
Block size (a multiple of 8K) can be specified per column.  The amount
of memory buffered per file is equal to the block size configured for
that column.

Executor touch points:

Column store offers access methods such as
aocs_beginscan(), aocs_getnext(), aocs_fetch(), aocs_rescan(),
aocs_insert().  One may come across a bunch of conditionals sprinkled
around executor code to invoke either heap, appendonly row or column
store access methods.  A better abstraction layer would have made the
code much cleaner but we don't have it today.

Scan of column oriented tables is performed by reading the projected
columns' files in a loop.  Note that the values stored in the files
contain only data, no MVCC information is stored.  Scan node returns a
tuple table slot, just like in case of heap tables.  The slot contains
virtual tuples.  The values and nulls arrays contain the values read
from respective files.  If needed during execution, the virtual tuples
are converted to what we call "memtuples".  Memtuples are similar to
minimal tuples but supposedly more optimized for accessing individual
datums.


Vacuum:

Vacuum operates on one appendonly segment at a time.  The appendonly
segment is locked such that concurrent insert/update/delete are
blocked from operating on this segment.  The segment is scanned using
aocs_getnext() and aovisimap is used to filter out invisible row
numbers.  Only the visible row numbers are written to a new segment.
After the scan is finished, the old segment is marked for deletion in
pg_aocsseg_<oid> table.  Transactions reading the table do not read
files belonging to segments that are marked for deletion.  After all
segments have been scanned, all auxiliary tables (aocsseg, aovisimap,
block directory and toast) are vacuumed.

Problem worth highlighting:

The design of aovisimap table poses hurdles for concurrent update and
delete operations on the same appendonly table.  One aovismap entry
covers multiple row numbers in the appendonly table.  If two
concurrent transactions delete or update different row numbers such
that they are covered by the same aovisimap tuple, one of them must
abort.

Not having chained updated versions of the same tuple. the
aocs_udpate() interface differs from the heap counterpart, leading to
anomalies such as
https://groups.google.com/a/greenplum.org/d/msg/gpdb-dev/73LOuQFkIps/1TNHW0lGAwAJ

Unique indexes are not supported on column oriented tables.
Identifying a concurrent transaction that might be inserting duplicate
tuples becomes hard because (1) lack of MVCC information in tuples and
(2) one block directory tuple covers multiple row numbers.

Asim (on behalf of Greenplum team)

Column store in Greenplum

Reply via email to