First I apologize if you receive it twice (slightly different versions as well). It looks like my email is miss-configured since reply to dev@couchdb.apache.org from mail client didn't go through.
> This eliminates the "length" of the path string concern and keeps every > document field a straight three entry path: > docid.revisionid.fieldid => [removed?, value] Michael this is a very good idea. I was working on proposal to use something like the following: * {NS} / sha256(user_name) / sha256(db_name) / index / by_seq / {update_seq} * {NS} / sha256(user_name) / sha256(db_name) / index / by_vsn / {vsn} * {NS} / sha256(user_name) / sha256(db_name) / data / docs / idx_by_docid / {docid} * {NS} / sha256(user_name) / sha256(db_name) / data / docs / {doc_idx} / content / {vsn} / body / {json_path} / {page_idx} Here: - {NS} is configurable namespace dedicated to CouchDB on FDB cluster. - {vsn} is FDB versionstamp - {page_idx} is separate path to represent scalar JSON values which exceed FDB limitations on value size - {docid} - document id - {doc_idx} - arbitrary value to save different revisions of the document. We add a level of indirection since we don't want to use {rev}. Because we might insert documents during _bulk operations. In this case inserted but not yet committed revisions of documents shouldn't be in list of available revisions. In the above model I couldn't figure out yet how to compress json_path. I'll send what I have so far into separate thread (when it would be started). Best regards, iilyak On 2019/01/24 12:46:14, Michael Fair <mich...@daclubhouse.net> wrote: > On Thu, Jan 24, 2019 at 2:11 AM Robert Samuel Newson <rnew...@apache.org> > wrote: > > > > > We’d expand each document into a series of key-value pairs, where the key > > is the full path into the object and the value is the scalar value. E.g, > > > > {“foo”: 12, “bar”, {“baz”: 13}} > > > > Would be > > > > foo => 12 > > bar.baz => 13 > > > I realize this quickly belongs in its own thread for later discussion, but > I wanted to point out/ask that by "interning the path strings" or using > some kind of deterministic hash algorithm, like SHA256 (or something > faster), on the "key path", couldn't you turn all variable-length strings > paths into a fixed size, integer type, field id? > > This eliminates the "length" of the path string concern and keeps every > document field a straight three entry path: > docid.revisionid.fieldid => [removed?, value] > > where: > * docid is the unique document identifier > * revisionid is obvious > * fieldid is the id of the path string (if a deterministic hash is used, > it's computed; if indexed, it's looked up/retrieved) > > This idea assumes that the "path.string" <-> fieldid correlation is also > managed by interning those strings somewhere. > > By adding the removed bit flag, a document becomes simply the aggregation > of all the latest revisionids for each distinct fieldid lower than the > revisionid requested; eliminating all duplicate storage requirements for > non-changing fields. > > When a document update comes in, it breaks the document down into its > constituent fields, and only needs to add an entry if the state of a field > has somehow changed from its previous revision. > > It seems like this whole idea might be optimally and transparently handled > directly inside FDB if FDB was aware of this revisionid "idea". I'm of > course not sure which system is expected to handle the described document > deconstruction. > > > ====== > This "fieldid hash" idea is also related to how the IPLD project creates > "pointers" to JSON documents inside its distributed p2p system to > hierarchically link portions of different documents together. > > Since a particular docid.revisionid represents a fixed point/state of a > document in the database, they use that reference as the "value" of a > special JSON Object that wants to "include"/"point to" the referenced > document. > The special JSON Object they used to create a "document link" looks like > this: {"/": "documenthashid"} > > The uploading document must explicitly put that reference in its own > document where it wants the system to link in the referenced document. > This hijacks this form of a JSON Object for this specific purpose and > prevents all higher level applications of IPLD from using it for any other > purpose. > > If desirable, the equivalent idea for CouchDB might be: {"_/": > "docid.revisionid.fieldid"} > > ====== > > I'm not saying any of this is a good idea, simply that (1) the string > length concerns could be eliminated by using interned strings (which likely > would also improve performance); and (2) this field level storage in FDB > could enable a basis for adding "document pointers" which I'm sure many > people would appreciate. > > > Mike >