Re: [DISCUSS] Rebase CouchDB on top of FoundationDB

Robert Samuel Newson Fri, 25 Jan 2019 00:59:20 -0800

Hi,

This isn’t the thread (yet!) to get into this level of detail just yet, but I 
do have some thoughts.


The two uses of sha256 here seem inappropriate to me. Users will typically 
choose short, readable names for both user_name and db_name, and this would 
force long, random looking strings on them, which reduces simplicity and 
increases key size, the opposite of what we want to do.

Instead, I think we enforce a limit of a few hundred characters on each item. 
If a user really can’t work within that constraint they can run the name 
through a message digest algorithm and deal with the fallout of that 
obfuscation themselves. Users that can name a database succinctly would not be 
penalised.

I do agree on the {NS} piece. We should not assume that we’re the only 
application inside the FoundationDB database. Indeed the foundationdb 
documentation regards this as a best practice 
(https://apple.github.io/foundationdb/api-python.html#subspaces: "As a best 
practice, API clients should use at least one subspace for application data.”).

B.

> On 24 Jan 2019, at 20:16, Ilya Khlopotov <iil...@apache.org> wrote:
> 
> First I apologize if you receive it twice (slightly different versions as 
> well). It looks like my email is miss-configured since reply to 
> dev@couchdb.apache.org from mail client didn't go through. 
> 
>> This eliminates the "length" of the path string concern and keeps every
>> document field a straight three entry path:
>> docid.revisionid.fieldid => [removed?, value]
> 
> Michael this is a very good idea. I was working on proposal to use something 
> like the following:
> 
> * {NS} / sha256(user_name) / sha256(db_name) / index / by_seq / {update_seq}
> * {NS} / sha256(user_name) / sha256(db_name) / index / by_vsn / {vsn}
> * {NS} / sha256(user_name) / sha256(db_name) / data / docs / idx_by_docid / 
> {docid}
> * {NS} / sha256(user_name) / sha256(db_name) / data / docs / {doc_idx} / 
> content / {vsn} / body / {json_path} / {page_idx}
> 
> Here:
> - {NS} is configurable namespace dedicated to CouchDB on FDB cluster.
> - {vsn} is FDB versionstamp
> - {page_idx} is separate path to represent scalar JSON values which exceed 
> FDB limitations on value size
> - {docid} - document id
> - {doc_idx} - arbitrary value to save different revisions of the document. We 
> add a level of indirection since we don't want to use {rev}. Because we might 
> insert documents during _bulk operations. In this case inserted but not yet 
> committed revisions of documents shouldn't be in list of available revisions.
> 
> In the above model I couldn't figure out yet how to compress json_path. 
> 
> I'll send what I have so far into separate thread (when it would be started).
> 
> Best regards,
> iilyak
> 
> 
> On 2019/01/24 12:46:14, Michael Fair <mich...@daclubhouse.net> wrote: 
>> On Thu, Jan 24, 2019 at 2:11 AM Robert Samuel Newson <rnew...@apache.org>
>> wrote:
>> 
>>> 
>>> We’d expand each document into a series of key-value pairs, where the key
>>> is the full path into the object and the value is the scalar value. E.g,
>>> 
>>> {“foo”: 12, “bar”, {“baz”: 13}}
>>> 
>>> Would be
>>> 
>>> foo => 12
>>> bar.baz => 13
>> 
>> 
>> I realize this quickly belongs in its own thread for later discussion, but
>> I wanted to point out/ask that by "interning the path strings" or using
>> some kind of deterministic hash algorithm, like SHA256 (or something
>> faster), on the "key path", couldn't you turn all variable-length strings
>> paths into a fixed size, integer type, field id?
>> 
>> This eliminates the "length" of the path string concern and keeps every
>> document field a straight three entry path:
>> docid.revisionid.fieldid => [removed?, value]
>> 
>> where:
>> * docid is the unique document identifier
>> * revisionid is obvious
>> * fieldid is the id of the path string (if a deterministic hash is used,
>> it's computed; if indexed, it's looked up/retrieved)
>> 
>> This idea assumes that the "path.string" <-> fieldid correlation is also
>> managed by interning those strings somewhere.
>> 
>> By adding the removed bit flag, a document becomes simply the aggregation
>> of all the latest revisionids for each distinct fieldid lower than the
>> revisionid requested; eliminating all duplicate storage requirements for
>> non-changing fields.
>> 
>> When a document update comes in, it breaks the document down into its
>> constituent fields, and only needs to add an entry if the state of a field
>> has somehow changed from its previous revision.
>> 
>> It seems like this whole idea might be optimally and transparently handled
>> directly inside FDB if FDB was aware of this revisionid "idea".  I'm of
>> course not sure which system is expected to handle the described document
>> deconstruction.
>> 
>> 
>> ======
>> This "fieldid hash" idea is also related to how the IPLD project creates
>> "pointers" to JSON documents inside its distributed p2p system to
>> hierarchically link portions of different documents together.
>> 
>> Since a particular docid.revisionid represents a fixed point/state of a
>> document in the database, they use that reference as the "value" of a
>> special JSON Object that wants to "include"/"point to" the referenced
>> document.
>> The special JSON Object they used to create a "document link" looks like
>> this: {"/": "documenthashid"}
>> 
>> The uploading document must explicitly put that reference in its own
>> document where it wants the system to link in the referenced document.
>> This hijacks this form of a JSON Object for this specific purpose and
>> prevents all higher level applications of IPLD from using it for any other
>> purpose.
>> 
>> If desirable, the equivalent idea for CouchDB might be: {"_/":
>> "docid.revisionid.fieldid"}
>> 
>> ======
>> 
>> I'm not saying any of this is a good idea, simply that (1) the string
>> length concerns could be eliminated by using interned strings (which likely
>> would also improve performance); and (2) this field level storage in FDB
>> could enable a basis for adding "document pointers" which I'm sure many
>> people would appreciate.
>> 
>> 
>> Mike
>>

Re: [DISCUSS] Rebase CouchDB on top of FoundationDB

Reply via email to