On Tue, Aug 27, 2019 at 12:03 AM Ashutosh Sharma <ashu.coe...@gmail.com> wrote:
> My point is, once we find the leaf page containing the given tid, we go > through each item in the page until we find the data corresponding to the > given tid which means we kind of perform a sequential scan at the page > level. I'm referring to the below loop in zsbt_attr_scan_fetch_array(). > > for (off = FirstOffsetNumber; off <= maxoff; off++) > { > ItemId iid = PageGetItemId(page, off); > ZSAttributeArrayItem *item = (ZSAttributeArrayItem *) > PageGetItem(page, iid); > > if (item->t_endtid <= nexttid) > continue; > > if (item->t_firsttid > nexttid) > break; > > But that's not true for IndexScan in case of heap table because there the > index tuple contains the exact physical location of tuple in the heap. So, > there is no need to scan the entire page. > You are correct that we currently go through each item in the leaf page that contains the given tid, specifically, the logic to retrieve all the attribute items inside a ZSAttStream is now moved to decode_attstream() in the latest code, and then in zsbt_attr_fetch() we again loop through each item we previously retrieved from decode_attstream() and look for the given tid. One optimization we can to is to tell decode_attstream() to stop decoding at the tid we are interested in. We can also apply other tricks to speed up the lookups in the page, for fixed length attribute, it is easy to do binary search instead of linear search, and for variable length attribute, we can probably try something that we didn't think of yet. 1) In zsundo_insert_finish(), there is a double call to > BufferGetPage(undobuf); Is that required ? > Fixed, thanks! 2) In zedstoream_fetch_row(), why is zsbt_tid_begin_scan() being called > twice? I'm referring to the below code. > > if (fetch_proj->num_proj_atts == 0) > { > .... > .... > zsbt_tid_begin_scan(rel, tid, tid + 1, > snapshot, > &fetch_proj->tid_scan); > fetch_proj->tid_scan.serializable = true; > > for (int i = 1; i < fetch_proj->num_proj_atts; i++) > { > int attno = fetch_proj->proj_atts[i]; > > zsbt_attr_begin_scan(rel, reldesc, attno, > &fetch_proj->attr_scans[i - 1]); > } > MemoryContextSwitchTo(oldcontext); > > zsbt_tid_begin_scan(rel, tid, tid + 1, snapshot, > &fetch_proj->tid_scan); > } > I removed the second call, thanks! > Also, for all types of update operation (be it key or non-key update) we > create a new tid for the new version of tuple. Can't we use the tid > associated with the old tuple for the cases where there is no concurrent > transactions to whom the old tuple is still visible. > Zedstore currently implement update as delete+insert, hence the old tid is not reused. We don't store the tuple in our UNDO log, and we only store the transaction information in the UNDO log. Reusing the tid of the old tuple means putting the old tuple in the UNDO log, which we have not implemented yet. Thanks for reporting, this is very helpful! Patches are welcome as well!