On Thu, Jan 24, 2019 at 2:11 AM Robert Samuel Newson <rnew...@apache.org>
wrote:

>
> We’d expand each document into a series of key-value pairs, where the key
> is the full path into the object and the value is the scalar value. E.g,
>
> {“foo”: 12, “bar”, {“baz”: 13}}
>
> Would be
>
> foo => 12
> bar.baz => 13


I realize this quickly belongs in its own thread for later discussion, but
I wanted to point out/ask that by "interning the path strings" or using
some kind of deterministic hash algorithm, like SHA256 (or something
faster), on the "key path", couldn't you turn all variable-length strings
paths into a fixed size, integer type, field id?

This eliminates the "length" of the path string concern and keeps every
document field a straight three entry path:
docid.revisionid.fieldid => [removed?, value]

where:
* docid is the unique document identifier
* revisionid is obvious
* fieldid is the id of the path string (if a deterministic hash is used,
it's computed; if indexed, it's looked up/retrieved)

This idea assumes that the "path.string" <-> fieldid correlation is also
managed by interning those strings somewhere.

By adding the removed bit flag, a document becomes simply the aggregation
of all the latest revisionids for each distinct fieldid lower than the
revisionid requested; eliminating all duplicate storage requirements for
non-changing fields.

When a document update comes in, it breaks the document down into its
constituent fields, and only needs to add an entry if the state of a field
has somehow changed from its previous revision.

It seems like this whole idea might be optimally and transparently handled
directly inside FDB if FDB was aware of this revisionid "idea".  I'm of
course not sure which system is expected to handle the described document
deconstruction.


======
This "fieldid hash" idea is also related to how the IPLD project creates
"pointers" to JSON documents inside its distributed p2p system to
hierarchically link portions of different documents together.

Since a particular docid.revisionid represents a fixed point/state of a
document in the database, they use that reference as the "value" of a
special JSON Object that wants to "include"/"point to" the referenced
document.
The special JSON Object they used to create a "document link" looks like
this: {"/": "documenthashid"}

The uploading document must explicitly put that reference in its own
document where it wants the system to link in the referenced document.
This hijacks this form of a JSON Object for this specific purpose and
prevents all higher level applications of IPLD from using it for any other
purpose.

If desirable, the equivalent idea for CouchDB might be: {"_/":
"docid.revisionid.fieldid"}

======

I'm not saying any of this is a good idea, simply that (1) the string
length concerns could be eliminated by using interned strings (which likely
would also improve performance); and (2) this field level storage in FDB
could enable a basis for adding "document pointers" which I'm sure many
people would appreciate.


Mike

Reply via email to