On Thu, Feb 7, 2019 at 9:35 PM Adam Kocoloski <kocol...@apache.org> wrote:
>
> Bob, Garren, Jan - heard you loud and clear, K.I.S.S. I do think it’s a bit 
> “simplistic" to exclusively choose simplicity over performance and storage 
> density. We’re (re)building a database here, one that has some users with 
> pretty demanding performance and scalability requirements. And yes, we should 
> certainly be testing and measuring. Kyle and team are setting up 
> infrastructure in IBM land to help with that now, but I also believe we can 
> design the data model and architecture with a basic performance model of 
> FoundationDB in mind:
>
> - reads cost 1ms
> - short range reads are the same cost as a single lookup
> - reads of independent parts of the keyspace can be parallelized for cheap
> - writes are zero-cost until commit time
>
> We ought to be able to use these assumptions to drive some decisions about 
> data models ahead of any end-to-end performance test.
>
> If there are specific elements of the edit conflicts management where you 
> think greater simplicity is warranted, let’s get those called out. Ilya noted 
> (correctly, in my opinion) that the term sharing stuff is one of those items. 
> It’s relatively complex, potentially a performance hit, and only saves on 
> storage density in the corner case of lots of edit conflicts. That’s a good 
> one to drop.
>
> I’m relatively happy with the revision history data model at this point. 
> Hopefully folks find it easy to grok, and it’s efficient for both reads and 
> writes. It costs some extra storage for conflict revisions compared to the 
> current tree representation (up to 16K per edit branch, with default 
> _revs_limit) but knowing what we know about the performance death spiral for 
> wide revision trees today I’ll happily make a storage vs. performance 
> tradeoff here :)
>
> Setting the shared term approach aside, I’ve still been mulling over the key 
> structure for the actual document data:
>
> -  I thought about trying to construct a special _conflicts subspace, but I 
> don’t like that approach because the choice of a “winning" revision can flip 
> back and forth very quickly with concurrent writers to different edit 
> branches. I think we really want to have a way for revisions to naturally 
> sort themselves so the winner is the first or last revision in a list.
>
> - Assuming we’re using key paths of the form (docid, revision-ish, path, to, 
> field), the goal here is to find an efficient way to get the last key with 
> prefix “docid” (assuming winner sorts last), and then all the keys that share 
> the same (docid, revision-ish) prefix as that one. I see two possible 
> approaches so far, neither perfect:
>
> Option 1: Execute a get_key() operation with a key selector that asks for the 
> last key less than “docid\xFF” (again assuming winner sorts last), and then 
> do a get_range_startswith() request setting the streaming mode to “want_all” 
> and the prefix to the docid plus whatever revision-ish we found from the 
> get_key() request. This is two roundtrips instead of one, but it always 
> retrieves exactly the right set of keys, and the second step is executed as 
> fast as possible.
>
> Option 2: Jump straight to get_range_startswith() request using only “docid” 
> as the prefix, then cancel the iteration once we reach a revision not equal 
> to the first one we see. We might transfer too much data, or we might end up 
> doing multiple roundtrips if the default “iterator” streaming mode sends too 
> little data to start (I haven’t checked what the default iteration block is 
> there), but in the typical case of zero edit conflicts we have a good chance 
> of retrieving the full document in one roundtrip.
>

I'm working through getting fdb bindings written and fully tested so
have a couple notes for the range streaming bits.

The first somewhat surprising bit was that the want_all streaming mode
doesn't actually return all rows in a single call. If the range is
large enough then a client has to invoke the get_range a number of
times tweaking parameters to know when to stop. I just ran into that
last night but discovered the logic isn't terribly complicated. The
Python implementation is at [1] for reference.

> I don’t have a good sense of which option wins out here from a performance 
> perspective, but they’re both operating on the same data model so easy enough 
> to test the alternatives. The important bit is getting the revision-ish 
> things to sort correctly. I think we can do that by generating something like
>

There are a number of different streaming modes and my wager will be
that we'll have to play with them a bit to get an idea of which is
best and likely it'll be which is best in which scenario. For
instance, when we have a defined range, want_all likely wins, but in
the "iterate until next revision found" situation I'd wager the
`iterator` mode would likely be best as it appears to adaptively size
row batches.

> revision-ish = NotDeleted/1bit : RevPos : RevHash
>
> with some suitable order-preserving encoding on the RevPos integer.
>

The tuple layer already supports minimally sized order preserving
encodings for integers [2] so reusing that is enough. Shrinking the
deleted flag into one bit may be a bit of a bother as the tuple layer
encodes booleans as a single byte (though its the single type byte,
not a type+value byte). A possibly simple approach to shrinking that
would be that for all deleted revisions we just encode the RevPos as a
negative number. That seems legit though its definitely registering on
my "too cute" meter so I may be missing a reason that'd break before
I'm properly caffeinated.

Paul

[1] 
https://github.com/apple/foundationdb/blob/6.0.15/bindings/python/fdb/impl.py#L358-L395
[2] https://github.com/apple/foundationdb/blob/master/design/tuple.md#integer

Reply via email to