Alex Wang and I have been doing some performance analysis of the most recent version of the zedstore branch, and have some interesting statistics to share.
We specifically focused on TPC-DS query 2, because it plays to what should be the strength of zedstore- namely it does a full table scan of only a subset of columns. I've attached the explain verbose output for reference. We scan two columns of 'catalog_sales', and two columns of 'web_sales'. -> Parallel Append -> Parallel Seq Scan on tpcds.catalog_sales Output: catalog_sales.cs_ext_sales_price, catalog_sales.cs_sold_date_sk -> Parallel Seq Scan on tpcds.web_sales Output: web_sales.ws_ext_sales_price, web_sales.ws_sold_date_sk For heap, it needs to do a full table scan of both tables, and we need to read the entire table into memory. For our dataset, that totals around 119GB of data. ***HEAP*** tpcds=# select pg_size_pretty(pg_relation_size('web_sales')); pg_size_pretty ---------------- 39 GB (1 row) tpcds=# select pg_size_pretty(pg_relation_size('catalog_sales')); pg_size_pretty ---------------- 80 GB (1 row) ***/HEAP*** With Zedstore the total relation size is smaller because of compression. When scanning the table, we only scan the blocks with data we are interested in, and leave the rest alone. So the total size we need to scan for these tables totals around 4GB ***ZEDSTORE*** zedstore=# select pg_size_pretty(pg_relation_size('web_sales')); pg_size_pretty ---------------- 20 GB (1 row) zedstore=# select pg_size_pretty(pg_relation_size('catalog_sales')); pg_size_pretty ---------------- 40 GB (1 row) zedstore=# with zedstore_tables as (select d.oid, f.* zedstore(# from (select c.oid zedstore(# from pg_am am zedstore(# join pg_class c on (c.relam = am.oid) zedstore(# where am.amname = 'zedstore') d, zedstore(# pg_zs_btree_pages(d.oid) f) zedstore-# select zs.attno, att.attname, zs.oid::regclass, count(zs.attno) as pages zedstore-# pg_size_pretty(count(zs.attno) * 8 * 1024) from zedstore_tables zs zedstore-# left join pg_attribute att on zs.attno = att.attnum zedstore-# and zs.oid = att.attrelid zedstore-# where zs.oid in ('catalog_sales'::regclass, 'web_sales'::regclass) zedstore-# and (att.attname in ('cs_ext_sales_price','cs_sold_date_sk','ws_ext_sales_price','ws_sold_date_sk') zedstore(# or zs.attno = 0) zedstore-# group by zs.attno, att.attname, zs.oid zedstore-# order by zs.oid , zs.attno; attno | attname | oid | pages | pg_size_pretty -------+--------------------+---------------+--------+---------------- 0 | | catalog_sales | 39549 | 309 MB 1 | cs_sold_date_sk | catalog_sales | 2441 | 19 MB 24 | cs_ext_sales_price | catalog_sales | 289158 | 2259 MB 0 | | web_sales | 20013 | 156 MB 1 | ws_sold_date_sk | web_sales | 17578 | 137 MB 24 | ws_ext_sales_price | web_sales | 144860 | 1132 MB ***/ZEDSTORE *** On our test machine, our tables were stored on a single spinning disk, so our read speed was pretty abysmal with this query. This query is I/O bound for us, so it was the single largest factor. With heap, the tables are scanned sequentially, and therefore can scan around 150MB of table data per second: ***HEAP*** avg-cpu: %user %nice %system %iowait %steal %idle 8.54 0.00 1.85 11.62 0.00 77.98 Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sdd 1685.33 0.00 157069.33 0.00 18.67 0.00 1.10 0.00 1.56 0.00 2.62 93.20 0.00 0.59 100.00 Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sdd 1655.33 0.00 154910.67 0.00 21.33 0.00 1.27 0.00 1.62 0.00 2.68 93.58 0.00 0.60 100.13 Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sdd 1746.33 0.00 155121.33 0.00 28.00 0.00 1.58 0.00 1.48 0.00 2.61 88.83 0.00 0.57 100.00 ***/HEAP*** Because zedstore resembled random I/O, the read speed was significantly hindered on our single disk. As a result, we saw ~150x slower read speeds. ***ZEDSTORE*** avg-cpu: %user %nice %system %iowait %steal %idle 6.24 0.00 1.22 6.34 0.00 86.20 Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sdb 129.33 0.00 1034.67 0.00 0.00 0.00 0.00 0.00 15.89 0.00 2.05 8.00 0.00 7.67 99.20 Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sdb 120.67 0.00 965.33 0.00 0.00 0.00 0.00 0.00 16.51 0.00 1.99 8.00 0.00 8.21 99.07 Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sdb 121.00 0.00 968.00 0.00 0.00 0.00 0.00 0.00 16.76 0.00 2.02 8.00 0.00 8.19 99.07 ***/ZEDSTORE*** The total query time: ***HEAP*** Execution Time: 758807.571 ms ***/HEAP*** ***ZEDSTORE*** Execution Time: 2111576.259 ms ***/ZEDSTORE*** Every attribute in zedstore is stored in a btree with the TID as a key. Unlike heap, the TID is a logical address, and not a physical one. The pages of one attribute are interspersed with the pages of all other attributes. When you do a sequential scan on zedstore the pages are, therefore, not stored in sequential order, so the access pattern can resemble random I/O. On our system, query time for zedstore was around 3x slower than heap for this query. If your storage does not handle semi-random reads very well, then zedstore can be very slow. This setup was a worst case scenario because random read was 150x slower than with sequential read. On hardware with better random I/O zedstore would really shine. On a side note, a second run of this query with zedstore was finished in around 57 seconds, because the ~4GB of column data was already in the relcache. The data size is smaller because we only store the relevant columns in memory, also the datums are compressed and encoded. Conversely, subsequently running the same query with heap still takes around 750 seconds because our system cannot store 119GB of relation data in the relcache/system caches. Our main takeaway with this is that anything we can do to group together data that is accessed together can help zedstore to have larger, more frequent sequential reads. On Mon, Oct 28, 2019 at 3:22 PM Taylor Vesely <tves...@pivotal.io> wrote: > > When a zedstore table is queried using *invalid* ctid, the server > > crashes due to assertion failure. See below, > > > > postgres=# select * from t2 where ctid = '(0, 0)'; > > server closed the connection unexpectedly > > This probably means the server terminated abnormally > > before or while processing the request. > > The connection to the server was lost. Attempting reset: Failed. > > Thank you for pointing that out! I will look into fixing that some > time this week. If we run without assertions the query still fails > with this error because zedstoream_tuple_tid_valid incorrectly reports > the TID as valid: > > ERROR: arrived at incorrect block 2 while descending zedstore btree > > > I believe above should have either returned 1 rows or failed with some > > user friendly error. > > Agreed. I think it should match the behavior of heap as closely as > possible. >
tpcds.02.explain_verbose.zedstore.out
Description: Binary data
tpcds.02.explain_verbose.heap.out
Description: Binary data