First I apologize if you receive it twice (slightly different versions as 
well). It looks like my email is miss-configured since reply to 
dev@couchdb.apache.org from mail client didn't go through. 

> This eliminates the "length" of the path string concern and keeps every
> document field a straight three entry path:
> docid.revisionid.fieldid => [removed?, value]
 
Michael this is a very good idea. I was working on proposal to use something 
like the following:

 * {NS} / sha256(user_name) / sha256(db_name) / index / by_seq / {update_seq}
 * {NS} / sha256(user_name) / sha256(db_name) / index / by_vsn / {vsn}
 * {NS} / sha256(user_name) / sha256(db_name) / data / docs / idx_by_docid / 
{docid}
 * {NS} / sha256(user_name) / sha256(db_name) / data / docs / {doc_idx} / 
content / {vsn} / body / {json_path} / {page_idx}

Here:
- {NS} is configurable namespace dedicated to CouchDB on FDB cluster.
- {vsn} is FDB versionstamp
- {page_idx} is separate path to represent scalar JSON values which exceed FDB 
limitations on value size
- {docid} - document id
- {doc_idx} - arbitrary value to save different revisions of the document. We 
add a level of indirection since we don't want to use {rev}. Because we might 
insert documents during _bulk operations. In this case inserted but not yet 
committed revisions of documents shouldn't be in list of available revisions.

In the above model I couldn't figure out yet how to compress json_path. 
 
I'll send what I have so far into separate thread (when it would be started).
 
Best regards,
iilyak


On 2019/01/24 12:46:14, Michael Fair <mich...@daclubhouse.net> wrote: 
> On Thu, Jan 24, 2019 at 2:11 AM Robert Samuel Newson <rnew...@apache.org>
> wrote:
> 
> >
> > We’d expand each document into a series of key-value pairs, where the key
> > is the full path into the object and the value is the scalar value. E.g,
> >
> > {“foo”: 12, “bar”, {“baz”: 13}}
> >
> > Would be
> >
> > foo => 12
> > bar.baz => 13
> 
> 
> I realize this quickly belongs in its own thread for later discussion, but
> I wanted to point out/ask that by "interning the path strings" or using
> some kind of deterministic hash algorithm, like SHA256 (or something
> faster), on the "key path", couldn't you turn all variable-length strings
> paths into a fixed size, integer type, field id?
> 
> This eliminates the "length" of the path string concern and keeps every
> document field a straight three entry path:
> docid.revisionid.fieldid => [removed?, value]
> 
> where:
> * docid is the unique document identifier
> * revisionid is obvious
> * fieldid is the id of the path string (if a deterministic hash is used,
> it's computed; if indexed, it's looked up/retrieved)
> 
> This idea assumes that the "path.string" <-> fieldid correlation is also
> managed by interning those strings somewhere.
> 
> By adding the removed bit flag, a document becomes simply the aggregation
> of all the latest revisionids for each distinct fieldid lower than the
> revisionid requested; eliminating all duplicate storage requirements for
> non-changing fields.
> 
> When a document update comes in, it breaks the document down into its
> constituent fields, and only needs to add an entry if the state of a field
> has somehow changed from its previous revision.
> 
> It seems like this whole idea might be optimally and transparently handled
> directly inside FDB if FDB was aware of this revisionid "idea".  I'm of
> course not sure which system is expected to handle the described document
> deconstruction.
> 
> 
> ======
> This "fieldid hash" idea is also related to how the IPLD project creates
> "pointers" to JSON documents inside its distributed p2p system to
> hierarchically link portions of different documents together.
> 
> Since a particular docid.revisionid represents a fixed point/state of a
> document in the database, they use that reference as the "value" of a
> special JSON Object that wants to "include"/"point to" the referenced
> document.
> The special JSON Object they used to create a "document link" looks like
> this: {"/": "documenthashid"}
> 
> The uploading document must explicitly put that reference in its own
> document where it wants the system to link in the referenced document.
> This hijacks this form of a JSON Object for this specific purpose and
> prevents all higher level applications of IPLD from using it for any other
> purpose.
> 
> If desirable, the equivalent idea for CouchDB might be: {"_/":
> "docid.revisionid.fieldid"}
> 
> ======
> 
> I'm not saying any of this is a good idea, simply that (1) the string
> length concerns could be eliminated by using interned strings (which likely
> would also improve performance); and (2) this field level storage in FDB
> could enable a basis for adding "document pointers" which I'm sure many
> people would appreciate.
> 
> 
> Mike
> 

Reply via email to