Dan Burkert has posted comments on this change. ( http://gerrit.cloudera.org:8080/10968 )
Change subject: schema: add is_deleted virtual column ...................................................................... Patch Set 6: I had a long discussion about this offline with Adar. As I understand it, 'virtual columns' are a generalized feature meant to allow querying the 'is-deleted' status of a row without having to plumb through special handling in many layers of the stack (client, RPC, tserver). Eventually this feature could be extended to more per-row metadata. I'm sympathetic to this motivation, but nonetheless I don't come down in favor of this feature. First, if looking at 'is-deleted' in isolation, I think it would be better to have an optional bitset that gets plumbed through which holds the deleted status. As I understand it a boolean flag indicating whether to retain deleted rows is already going to have to get plumbed through, so this wouldn't be a big complication. As a bonus, bitsets are more efficient in terms of memory overhead (1 bit per row vs 1 byte per row). Ultimately I think this will have a cleaner implementation than checking against known column values. It's clear that in the long run incremental (CDC) scans should use a sparse row representation, in which case re-using the existing row-batch layout is out the window, anyway. Second, it doesn't seem to me that a generalized virtual column feature would be useful for other metadata. Here's a partial list of attributes that have been thrown around, and an explanation of why I don't think virtual columns are appropriate: - per-row tablet server and per-row tablet information. It would be preferable to expose this on the scan batch, it's not necessary to hold a copy of this per row, and it's not necessary to send it across the wire. - last-updated-timestamp per-row. If I understand tablet internals correctly, this can't ever be provided with per-row granularity, since we dispose of per-row timestamps during compaction after the ancient history watermark. I think the best we could do here is to provide it per scan batch. - DRS ID / CFile ID / block (page) ID per row. I think this use-case is a real stretch, motivation wise. I think there will never be much demand for this info, and what little demand there is would be better served by just providing aggregate counts of # of DRS's, # of CFiles, and # of blocks scanned per row batch. To wrap up this already-too-long comment, I'd favor not introducing this abstraction unless we can identify a realistic second use-case for it. -- To view, visit http://gerrit.cloudera.org:8080/10968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I56eb1d44ba8bfbd76d8bb794b8076b695782939e Gerrit-Change-Number: 10968 Gerrit-PatchSet: 6 Gerrit-Owner: Adar Dembo <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Dan Burkert <[email protected]> Gerrit-Reviewer: Grant Henke <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-Comment-Date: Fri, 20 Jul 2018 16:41:14 +0000 Gerrit-HasComments: No
