[kudu-CR] schema: add is deleted virtual column

Dan Burkert (Code Review) Fri, 20 Jul 2018 09:41:33 -0700

Dan Burkert has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10968 )


Change subject: schema: add is_deleted virtual column
......................................................................


Patch Set 6:

I had a long discussion about this offline with Adar.  As I understand it, 
'virtual columns' are a generalized feature meant to allow querying the 
'is-deleted' status of a row without having to plumb through special handling 
in many layers of the stack (client, RPC, tserver).  Eventually this feature 
could be extended to more per-row metadata.  I'm sympathetic to this 
motivation, but nonetheless I don't come down in favor of this feature.

First, if looking at 'is-deleted' in isolation, I think it would be better to 
have an optional bitset that gets plumbed through which holds the deleted 
status.  As I understand it a boolean flag indicating whether to retain deleted 
rows is already going to have to get plumbed through, so this wouldn't be a big 
complication.  As a bonus, bitsets are more efficient in terms of memory 
overhead (1 bit per row vs 1 byte per row).  Ultimately I think this will have 
a cleaner implementation than checking against known column values.  It's clear 
that in the long run incremental (CDC) scans should use a sparse row 
representation, in which case re-using the existing row-batch layout is out the 
window, anyway.

Second, it doesn't seem to me that a generalized virtual column feature would 
be useful for other metadata.  Here's a partial list of attributes that have 
been thrown around, and an explanation of why I don't think virtual columns are 
appropriate:

- per-row tablet server and per-row tablet information.  It would be preferable 
to expose this on the scan batch, it's not necessary to hold a copy of this per 
row, and it's not necessary to send it across the wire.

- last-updated-timestamp per-row. If I understand tablet internals correctly, 
this can't ever be provided with per-row granularity, since we dispose of 
per-row timestamps during compaction after the ancient history watermark.  I 
think the best we could do here is to provide it per scan batch.

- DRS ID / CFile ID / block (page) ID per row. I think this use-case is a real 
stretch, motivation wise.  I think there will never be much demand for this 
info, and what little demand there is would be better served by just providing 
aggregate counts of # of DRS's, # of CFiles, and # of blocks scanned per row 
batch.

To wrap up this already-too-long comment, I'd favor not introducing this 
abstraction unless we can identify a realistic second use-case for it.


--
To view, visit http://gerrit.cloudera.org:8080/10968
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I56eb1d44ba8bfbd76d8bb794b8076b695782939e
Gerrit-Change-Number: 10968
Gerrit-PatchSet: 6
Gerrit-Owner: Adar Dembo <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Dan Burkert <[email protected]>
Gerrit-Reviewer: Grant Henke <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-Comment-Date: Fri, 20 Jul 2018 16:41:14 +0000
Gerrit-HasComments: No

[kudu-CR] schema: add is deleted virtual column

Reply via email to