On 23/05/19 12:07 PM, Ashwin Agrawal wrote:

We (Heikki, me and Melanie) are continuing to build Zedstore. Wish to
share the recent additions and modifications. Attaching a patch
with the latest code. Link to github branch [1] to follow
along. The approach we have been leaning towards is to build required
functionality, get passing the test and then continue to iterate to
optimize the same. It's still work-in-progress.

Sharing the details now, as have reached our next milestone for
Zedstore. All table AM API's are implemented for Zedstore (except
compute_xid_horizon_for_tuples, seems need test for it first).

Current State:

- A new type of item added to Zedstore "Array item", to boost
  compression and performance. Based on Konstantin's performance
  experiments [2] and inputs from Tomas Vodra [3], this is
  added. Array item holds multiple datums, with consecutive TIDs and
  the same visibility information. An array item saves space compared
  to multiple single items, by leaving out repetitive UNDO and TID
  fields. An array item cannot mix NULLs and non-NULLs. So, those
  experiments should result in improved performance now. Inserting
  data via COPY creates array items currently. Code for insert has not
  been modified from last time. Making singleton inserts or insert
  into select, performant is still on the todo list.

- Now we have a separate and dedicated meta-column btree alongside
  rest of the data column btrees. This special or first btree for
  meta-column is used to assign TIDs for tuples, track the UNDO
  location which provides visibility information. Also, this special
  btree, which always exists, helps to support zero-column tables
  (which can be a result of ADD COLUMN DROP COLUMN actions as
  well). Plus, having meta-data stored separately from data, helps to
  get better compression ratios. And also helps to further simplify
  the overall design/implementation as for deletes just need to edit
  the meta-column and avoid touching the actual data btrees. Index
  scans can just perform visibility checks based on this meta-column
  and fetch required datums only for visible tuples. For tuple locks
  also just need to access this meta-column only. Previously, every
  column btree used to carry the same undo pointer. Thus visibility
  check could be potentially performed, with the past layout, using
  any column. But considering overall simplification new layout
  provides it's fine to give up on that aspect. Having dedicated
  meta-column highly simplified handling for add columns with default
  and null values, as this column deterministically provides all the
  TIDs present in the table, which can't be said for any other data
  columns due to default or null values during add column.

- Free Page Map implemented. The Free Page Map keeps track of unused
  pages in the relation. The FPM is also a b-tree, indexed by physical
  block number. To be more compact, it stores "extents", i.e. block
  ranges, rather than just blocks, when possible. An interesting paper [4] on
  how modern filesystems manage space acted as a good source for ideas.

- Tuple locks implemented

- Serializable isolation handled

- With "default_table_access_method=zedstore"
  - 31 out of 194 failing regress tests
  - 10 out of 86 failing isolation tests
Many of the current failing tests are due to plan differences, like
Index scans selected for zedstore over IndexOnly scans, as zedstore
doesn't yet have visibility map. I am yet to give a thought on
index-only scans. Or plan diffs due to table size differences between
heap and zedstore.

Next few milestones we wish to hit for Zedstore:
- Make check regress green
- Make check isolation green
- Zedstore crash safe (means also replication safe). Implement WAL
  logs
- Performance profiling and optimizations for Insert, Selects, Index
  Scans, etc...
- Once UNDO framework lands in Upstream, Zedstore leverages it instead
  of its own version of UNDO

Open questions / discussion items:

- how best to get "column projection list" from planner? (currently,
  we walk plan and find the columns required for the query in
  the executor, refer GetNeededColumnsForNode())

- how to pass the "column projection list" to table AM? (as stated in
  initial email, currently we have modified table am API to pass the
  projection to AM)

- TID treated as (block, offset) in current indexing code

- Physical tlist optimization? (currently, we disabled it for
  zedstore)

Team:
Melanie joined Heikki and me to write code for zedstore. Majority of
the code continues to be contributed by Heikki. We are continuing to
have fun building column store implementation and iterate
aggressively.

References:
1] https://github.com/greenplum-db/postgres/tree/zedstore
2] https://www.postgresql.org/message-id/3978b57e-fe25-ca6b-f56c-48084417e115%40postgrespro.ru 3] https://www.postgresql.org/message-id/20190415173254.nlnk2xqhgt7c5pta%40development
4] https://www.kernel.org/doc/ols/2010/ols2010-pages-121-132.pdf


FWIW - building this against latest 12 beta1:

Loading and examining the standard pgbench schema (with the old names, sorry) in v10 (standard heap_ and v12 (zedstore)

v10:

bench=# \i load.sql
COPY 100
Time: 16.335 ms
COPY 1000
Time: 16.748 ms
COPY 10000000
Time: 50276.230 ms (00:50.276)
bench=# \dt+
                        List of relations
 Schema |   Name   | Type  |  Owner   |    Size    | Description
--------+----------+-------+----------+------------+-------------
 public | accounts | table | postgres | 1281 MB    |
 public | branches | table | postgres | 8192 bytes |
 public | history  | table | postgres | 0 bytes    |
 public | tellers  | table | postgres | 72 kB      |

v12+zedstore:

bench=# \i load.sql
COPY 100
Time: 0.656 ms
COPY 1000
Time: 3.573 ms
COPY 10000000
Time: 26244.832 ms (00:26.245)
bench=# \dt+
                      List of relations
 Schema |   Name   | Type  |  Owner   |  Size   | Description
--------+----------+-------+----------+---------+-------------
 public | accounts | table | postgres | 264 MB  |
 public | branches | table | postgres | 56 kB   |
 public | history  | table | postgres | 0 bytes |
 public | tellers  | table | postgres | 64 kB   |

So a good improvement in load times and on disk footprint! Also note that I did not build with lz4 so looks like you guys have fixed the quirks with compression making things bigger.

regards

Mark



Reply via email to