Mike Percy has posted comments on this change.

Change subject: Non-covering Range Partitions design doc
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/2772/1/docs/design-docs/non-covering-range-partitions.md
File docs/design-docs/non-covering-range-partitions.md:

Line 87:               RANGE BOUND (("North America"), ("North America\0")),
       :               RANGE BOUND (("Europe"), ("Europe\0")),
       :               RANGE BOUND (("Asia"), ("Asia\0"));
> This example suggests that it would be useful to express "point" ranges (i.
The examples shown are "exact match" types of partitions, and I wonder if some 
kind of syntax expressing this such as DISTRIBUTE BY DISTINCT VALUE (region) 
would be more usable for those. Like non-covering hash partitioning with 
unlimited buckets. Plus additional syntax like INITIAL PARTITIONS ("North 
America", "Europe", "Asia"). That said, the RANGE BOUND syntax would still be 
required for range partitioned tables.

Just to nitpick on the range syntax for a second, maybe DISTRIBUTE BY RANGE 
(region) INITIAL PARTITIONS (("NA", "NA\0"), ("Asia", "Asia\0")) would be more 
readable. Just an idea.


Line 121: only
        : recontacting the master after a configurable timeout.
> Would it be sufficient to recontact the master in the event that a write yi
I'm trying to come up with a perfectly consistent way to avoid skipping valid 
partitions in a scan. It's tough. The below isn't perfect, but probably better 
than a client-side timeout.

If our "alter table" operations really do alter the table schema, we could rev 
the schema version and mark this as a special "new partition" schema change. 
(we may also have to keep track of the latest version that had a partition 
change.) After the master instantiates the new tablet replicas, it exposes the 
addition of a new partition to clients when metadata calls are made to the 
master. Once this schema change is made visible, the master sends a schema 
change RPC to all tablets notifying them of it, along with the (propagated) 
timestamp of when this became visible on the master, since before then no one 
could have written data to that partition.

At the tablet servers, if a client does a snapshot scan at a later timestamp 
than the latest schema's "visibility timestamp" using an older schema that 
doesn't know about the partition change, each TS returns an error telling the 
client to refresh its metadata. The client does so from the master, 
automatically, and retries the write.

Unfortunately, there is some window of time between when the new schema is made 
visible on the master and when the TS's apply the new schema. Data could be 
written during that window, and snapshot scans run with an old schema would 
cause clients to skip entire partitions, resulting in what looks like temporary 
lost writes.

I think if we had a transaction manager that could run two-phase commit, we 
could do better.


-- 
To view, visit http://gerrit.cloudera.org:8080/2772
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3e530eda60c00faf066c41b6bdb2b37f6d96a5dc
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Dan Burkert <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Jean-Daniel Cryans
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

Reply via email to