Re: Zedstore - compressed in-core columnar storage

Ashutosh Sharma Tue, 17 Sep 2019 04:16:12 -0700

On Thu, Aug 29, 2019 at 5:39 PM Heikki Linnakangas <[email protected]> wrote:
>
> On 29/08/2019 14:30, Ashutosh Sharma wrote:
> >
> > On Wed, Aug 28, 2019 at 5:30 AM Alexandra Wang <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> >     You are correct that we currently go through each item in the leaf
> >     page that
> >     contains the given tid, specifically, the logic to retrieve all the
> >     attribute
> >     items inside a ZSAttStream is now moved to decode_attstream() in the
> >     latest
> >     code, and then in zsbt_attr_fetch() we again loop through each item we
> >     previously retrieved from decode_attstream() and look for the given
> >     tid.
> >
> >
> > Okay. Any idea why this new way of storing attribute data as streams
> > (lowerstream and upperstream) has been chosen just for the attributes
> > but not for tids. Are only attribute blocks compressed but not the tids
> > blocks?
>
> Right, only attribute blocks are currently compressed. Tid blocks need
> to be modified when there are UPDATEs or DELETE, so I think having to
> decompress and recompress them would be more costly. Also, there is no
> user data on the TID tree, and the Simple-8b encoded codewords used to
> represent the TIDs are already pretty compact. I'm not sure how much
> gain you would get from passing it through a general purpose compressor.
>
> I could be wrong though. We could certainly try it out, and see how it
> performs.
>
> >     One
> >     optimization we can to is to tell decode_attstream() to stop
> >     decoding at the
> >     tid we are interested in. We can also apply other tricks to speed up the
> >     lookups in the page, for fixed length attribute, it is easy to do
> >     binary search
> >     instead of linear search, and for variable length attribute, we can
> >     probably
> >     try something that we didn't think of yet.
> >
> >
> > I think we can probably ask decode_attstream() to stop once it has found
> > the tid that we are searching for but then we only need to do that for
> > Index Scans.
>
> I've been thinking that we should add a few "bookmarks" on long streams,
> so that you could skip e.g. to the midpoint in a stream. It's a tradeoff
> though; when you add more information for random access, it makes the
> representation less compact.
>
> >     Zedstore currently implement update as delete+insert, hence the old
> >     tid is not
> >     reused. We don't store the tuple in our UNDO log, and we only store the
> >     transaction information in the UNDO log. Reusing the tid of the old
> >     tuple means
> >     putting the old tuple in the UNDO log, which we have not implemented
> >     yet.
> >
> > OKay, so that means performing update on a non-key attribute would also
> > require changes in the index table. In short, HOT update is currently
> > not possible with zedstore table. Am I right?
>
> That's right. There's a lot of potential gain for doing HOT updates. For
> example, if you UPDATE one column on every row on a table, ideally you
> would only modify the attribute tree containing that column. But that
> hasn't been implemented.


Thanks Heikki for your reply. After quite some time today I got chance
to look back into the code. I could see that you have changed the
tuple insertion and update mechanism a bit. As per the latest changes
all the tuples being inserted/updated in a transaction are spooled
into a hash table and then flushed at the time of transaction commit
and probably due to this change, I could see that the server crashes
when trying to perform UPDATE operation on a zedstore table having 10
lacs record. See below example,

create table t1(a int, b int) using zedstore;
insert into t1 select i, i+10 from generate_series(1, 1000000) i;
postgres=# update t1 set b = 200;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Above update statement crashed due to some extensive memory leak.

Further, the UPDATE operation on zedstore table is very slow. I think
that's because in case of zedstore table we have to update all the
btree data structures even if one column is updated and that really
sucks. Please let me know if there is some other reason for it.

I also found some typos when going through the writeup in
zedstore_internal.h and thought of correcting those. Attached is the
patch with the changes.

Thanks,
-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

diff --git a/src/include/access/zedstore_internal.h b/src/include/access/zedstore_internal.h
index 5330c70..c2b726b 100644
--- a/src/include/access/zedstore_internal.h
+++ b/src/include/access/zedstore_internal.h
@@ -177,7 +177,7 @@ ZSBtreeInternalPageIsFull(Page page)
  * Attribute B-tree leaf page layout
  *
  * Leaf pages in the attribute trees don't follow the normal page layout
- * with line pointers and items. They use the standard page page header,
+ * with line pointers and items. They use the standard page header,
  * with pd_lower and pd_upper, but the data stored in the lower and upper
  * parts are different from the normal usage.
  *
@@ -226,7 +226,7 @@ ZSBtreeInternalPageIsFull(Page page)
  * replacing both with one larger compressed stream.
  *
  * The names "lower" and "upper" refer to the physical location of
- * the stream on the page. The data in the in the lower attstream
+ * the stream on the page. The data in the lower attstream
  * have higher-numbered TIDs than the data in the upper attstream.
  * No overlap is allowed. This works well with the usual usage
  * pattern that new data is added to the end (i.e. with increasing
@@ -255,7 +255,7 @@ ZSBtreeInternalPageIsFull(Page page)
  * fits, and second time to compress just what fits. But that would be twice
  * as slow. In practice, the wasted space doesn't matter much. We try to
  * keep each chunk relatively small, to minimize the waste. And because we
- * now the next chunk wouldn't fit on the page anyway, there isn't much else
+ * know the next chunk wouldn't fit on the page anyway, there isn't much else
  * we could do with the wasted space, anyway.
  */
 typedef struct

Re: Zedstore - compressed in-core columnar storage

Reply via email to