Hi, This isn’t the thread (yet!) to get into this level of detail just yet, but I do have some thoughts.
The two uses of sha256 here seem inappropriate to me. Users will typically choose short, readable names for both user_name and db_name, and this would force long, random looking strings on them, which reduces simplicity and increases key size, the opposite of what we want to do. Instead, I think we enforce a limit of a few hundred characters on each item. If a user really can’t work within that constraint they can run the name through a message digest algorithm and deal with the fallout of that obfuscation themselves. Users that can name a database succinctly would not be penalised. I do agree on the {NS} piece. We should not assume that we’re the only application inside the FoundationDB database. Indeed the foundationdb documentation regards this as a best practice (https://apple.github.io/foundationdb/api-python.html#subspaces: "As a best practice, API clients should use at least one subspace for application data.”). B. > On 24 Jan 2019, at 20:16, Ilya Khlopotov <iil...@apache.org> wrote: > > First I apologize if you receive it twice (slightly different versions as > well). It looks like my email is miss-configured since reply to > dev@couchdb.apache.org from mail client didn't go through. > >> This eliminates the "length" of the path string concern and keeps every >> document field a straight three entry path: >> docid.revisionid.fieldid => [removed?, value] > > Michael this is a very good idea. I was working on proposal to use something > like the following: > > * {NS} / sha256(user_name) / sha256(db_name) / index / by_seq / {update_seq} > * {NS} / sha256(user_name) / sha256(db_name) / index / by_vsn / {vsn} > * {NS} / sha256(user_name) / sha256(db_name) / data / docs / idx_by_docid / > {docid} > * {NS} / sha256(user_name) / sha256(db_name) / data / docs / {doc_idx} / > content / {vsn} / body / {json_path} / {page_idx} > > Here: > - {NS} is configurable namespace dedicated to CouchDB on FDB cluster. > - {vsn} is FDB versionstamp > - {page_idx} is separate path to represent scalar JSON values which exceed > FDB limitations on value size > - {docid} - document id > - {doc_idx} - arbitrary value to save different revisions of the document. We > add a level of indirection since we don't want to use {rev}. Because we might > insert documents during _bulk operations. In this case inserted but not yet > committed revisions of documents shouldn't be in list of available revisions. > > In the above model I couldn't figure out yet how to compress json_path. > > I'll send what I have so far into separate thread (when it would be started). > > Best regards, > iilyak > > > On 2019/01/24 12:46:14, Michael Fair <mich...@daclubhouse.net> wrote: >> On Thu, Jan 24, 2019 at 2:11 AM Robert Samuel Newson <rnew...@apache.org> >> wrote: >> >>> >>> We’d expand each document into a series of key-value pairs, where the key >>> is the full path into the object and the value is the scalar value. E.g, >>> >>> {“foo”: 12, “bar”, {“baz”: 13}} >>> >>> Would be >>> >>> foo => 12 >>> bar.baz => 13 >> >> >> I realize this quickly belongs in its own thread for later discussion, but >> I wanted to point out/ask that by "interning the path strings" or using >> some kind of deterministic hash algorithm, like SHA256 (or something >> faster), on the "key path", couldn't you turn all variable-length strings >> paths into a fixed size, integer type, field id? >> >> This eliminates the "length" of the path string concern and keeps every >> document field a straight three entry path: >> docid.revisionid.fieldid => [removed?, value] >> >> where: >> * docid is the unique document identifier >> * revisionid is obvious >> * fieldid is the id of the path string (if a deterministic hash is used, >> it's computed; if indexed, it's looked up/retrieved) >> >> This idea assumes that the "path.string" <-> fieldid correlation is also >> managed by interning those strings somewhere. >> >> By adding the removed bit flag, a document becomes simply the aggregation >> of all the latest revisionids for each distinct fieldid lower than the >> revisionid requested; eliminating all duplicate storage requirements for >> non-changing fields. >> >> When a document update comes in, it breaks the document down into its >> constituent fields, and only needs to add an entry if the state of a field >> has somehow changed from its previous revision. >> >> It seems like this whole idea might be optimally and transparently handled >> directly inside FDB if FDB was aware of this revisionid "idea". I'm of >> course not sure which system is expected to handle the described document >> deconstruction. >> >> >> ====== >> This "fieldid hash" idea is also related to how the IPLD project creates >> "pointers" to JSON documents inside its distributed p2p system to >> hierarchically link portions of different documents together. >> >> Since a particular docid.revisionid represents a fixed point/state of a >> document in the database, they use that reference as the "value" of a >> special JSON Object that wants to "include"/"point to" the referenced >> document. >> The special JSON Object they used to create a "document link" looks like >> this: {"/": "documenthashid"} >> >> The uploading document must explicitly put that reference in its own >> document where it wants the system to link in the referenced document. >> This hijacks this form of a JSON Object for this specific purpose and >> prevents all higher level applications of IPLD from using it for any other >> purpose. >> >> If desirable, the equivalent idea for CouchDB might be: {"_/": >> "docid.revisionid.fieldid"} >> >> ====== >> >> I'm not saying any of this is a good idea, simply that (1) the string >> length concerns could be eliminated by using interned strings (which likely >> would also improve performance); and (2) this field level storage in FDB >> could enable a basis for adding "document pointers" which I'm sure many >> people would appreciate. >> >> >> Mike >>