Mike Percy has posted comments on this change. Change subject: Non-covering Range Partitions design doc ......................................................................
Patch Set 1: (2 comments) http://gerrit.cloudera.org:8080/#/c/2772/1/docs/design-docs/non-covering-range-partitions.md File docs/design-docs/non-covering-range-partitions.md: Line 87: RANGE BOUND (("North America"), ("North America\0")), : RANGE BOUND (("Europe"), ("Europe\0")), : RANGE BOUND (("Asia"), ("Asia\0")); > This example suggests that it would be useful to express "point" ranges (i. The examples shown are "exact match" types of partitions, and I wonder if some kind of syntax expressing this such as DISTRIBUTE BY DISTINCT VALUE (region) would be more usable for those. Like non-covering hash partitioning with unlimited buckets. Plus additional syntax like INITIAL PARTITIONS ("North America", "Europe", "Asia"). That said, the RANGE BOUND syntax would still be required for range partitioned tables. Just to nitpick on the range syntax for a second, maybe DISTRIBUTE BY RANGE (region) INITIAL PARTITIONS (("NA", "NA\0"), ("Asia", "Asia\0")) would be more readable. Just an idea. Line 121: only : recontacting the master after a configurable timeout. > Would it be sufficient to recontact the master in the event that a write yi I'm trying to come up with a perfectly consistent way to avoid skipping valid partitions in a scan. It's tough. The below isn't perfect, but probably better than a client-side timeout. If our "alter table" operations really do alter the table schema, we could rev the schema version and mark this as a special "new partition" schema change. (we may also have to keep track of the latest version that had a partition change.) After the master instantiates the new tablet replicas, it exposes the addition of a new partition to clients when metadata calls are made to the master. Once this schema change is made visible, the master sends a schema change RPC to all tablets notifying them of it, along with the (propagated) timestamp of when this became visible on the master, since before then no one could have written data to that partition. At the tablet servers, if a client does a snapshot scan at a later timestamp than the latest schema's "visibility timestamp" using an older schema that doesn't know about the partition change, each TS returns an error telling the client to refresh its metadata. The client does so from the master, automatically, and retries the write. Unfortunately, there is some window of time between when the new schema is made visible on the master and when the TS's apply the new schema. Data could be written during that window, and snapshot scans run with an old schema would cause clients to skip entire partitions, resulting in what looks like temporary lost writes. I think if we had a transaction manager that could run two-phase commit, we could do better. -- To view, visit http://gerrit.cloudera.org:8080/2772 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3e530eda60c00faf066c41b6bdb2b37f6d96a5dc Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan Burkert <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Dan Burkert <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
