Re: [DISCUSS] : things we need to solve/decide : storing JSON documents

Michael Fair Wed, 30 Jan 2019 11:09:03 -0800

I know the claim was to avoid "revisions" and "conflicts" discussion in
this thread but isn't that unavoidable.


In scheme #1 you have multiple keys with the same DOCID/PART_IDX but
different data.
In schemes #2 / #3 you have multiple copies of the JSON_PATH but different
values.

The trivial fix is to use DOCID/REVISIONID as DOC_KEY.

Mike

On Wed, Jan 30, 2019 at 9:53 AM Ilya Khlopotov <iil...@apache.org> wrote:

> FoundationDB Records layer uses global schema for JSON documents. They
> also have a nice way of creating indexes and schema evolution support.
> However this support comes at a cost of extra lookups in different
> subspace. With local mapping table we almost (except a corner case) certain
> that the schema and JSON fields would be collocated on a single node. Due
> to common prefix.
>
> Best regards,
> iilyak
> On 2019/01/30 17:05:01, Jan Lehnardt <j...@apache.org> wrote:
> > Ah sure, if we store the *cough* schema per doc, then it's not that
> easy. An iteration of this proposal could store paths globally with ids
> that the k/v store then uses for keys, which would enable what I described,
> but happy to ignore this for the time being. :)
> >
> > Cheers
> > Jan
> > —
> >
> > > On 30. Jan 2019, at 17:58, Adam Kocoloski <kocol...@apache.org> wrote:
> > >
> > > Jan, I don’t think it does have that "fun property #2", as the mapping
> is created separately for each document. In this proposal the field name
> “foo” could map to 2 in one document and 42 in another.
> > >
> > > Thanks for the proposal Ilya. Personally I wonder if the 10KB limit on
> field paths is anything more than a theoretical concern. It’s hard for me
> to imagine a useful schema that would get anywhere near that deep, but
> maybe I’m insufficiently creative :) There’s certainly a storage overhead
> from repeating the upper portion of a path over and over again, but that’s
> also something the storage engine can optimize away through prefix elision.
> The current production storage engine in FoundationDB does not do this
> elision, but the new one in development does.
> > >
> > > The value size limit is probably not so theoretical. I think as a
> project we could choose to impose a 100KB size limit on scalar values - a
> user who had a string longer than 100KB could chunk it up into an array of
> strings pretty easily to work around that limit. But let’s say we don’t
> want to impose that limit. In your design, how do I distinguish {PART_IDX}
> from the elements of the {JSON_PATH}? I was kind of expecting to see some
> magic value indicating that the subsequent set of keys with the same prefix
> are all elements of a “multi-part object”:
> > >
> > > {DB_DOCS_NS} / {DOC_KEY} / {JSON_PATH}  = kMULTIPART
> > > {DB_DOCS_NS} / {DOC_KEY} / {JSON_PATH} / {PART_IDX}  = “First 100 KB …"
> > > ...
> > >
> > > You might have figured out something more efficient that saves a KV
> here but I can’t quite grok it.
> > >
> > > Cheers, Adam
> > >
> > >
> > >> On Jan 30, 2019, at 8:24 AM, Jan Lehnardt <j...@apache.org> wrote:
> > >>
> > >>
> > >>
> > >>> On 30. Jan 2019, at 14:22, Jan Lehnardt <j...@apache.org <mailto:
> j...@apache.org>> wrote:
> > >>>
> > >>> Thanks Ilya for getting this started!
> > >>>
> > >>> Two quick notes on this one:
> > >>>
> > >>> 1. note that JSON does not guarantee object key order and that
> CouchDB has never guaranteed it either, and with say emit(doc.foo,
> doc.bar), if either emit() parameter was an object, the
> undefined-sort-order of SpiderMonkey would mix things up. While worth
> bringing up, this is not a BC break.
> > >>>
> > >>> 2. This would have the fun property of being able to rename a key
> inside all docs that have that key.
> > >>
> > >> …in one short operation.
> > >>
> > >> Best
> > >> Jan
> > >> —
> > >>>
> > >>> Best
> > >>> Jan
> > >>> —
> > >>>
> > >>>> On 30. Jan 2019, at 14:05, Ilya Khlopotov <iil...@apache.org>
> wrote:
> > >>>>
> > >>>> # First proposal
> > >>>>
> > >>>> In order to overcome FoudationDB limitations on key size (10 kB)
> and value size (100 kB) we could use the following approach.
> > >>>>
> > >>>> Bellow the paths are using slash for illustration purposes only. We
> can use nested subspaces, tuples, directories or something else.
> > >>>>
> > >>>> - Store documents in a subspace or directory  (to keep prefix for a
> key short)
> > >>>> - When we store the document we would enumerate all field names (0
> and 1 are reserved) and store the mapping table in the key which look like:
> > >>>> ```
> > >>>> {DB_DOCS_NS} / {DOC_KEY} / 0
> > >>>> ```
> > >>>> - Flatten the JSON document (convert it into key value pairs where
> the key is `JSON_PATH` and value is `SCALAR_VALUE`)
> > >>>> - Replace elements of JSON_PATH with integers from mapping table we
> constructed earlier
> > >>>> - When we have array use `1 / {array_idx}`
> > >>>> - Store scalar values in the keys which look like the following (we
> use `JSON_PATH` with integers).
> > >>>> ```
> > >>>> {DB_DOCS_NS} / {DOC_KEY} / {JSON_PATH}
> > >>>> ```
> > >>>> - If the scalar value exceeds 100kB we would split it and store
> every part under key constructed as:
> > >>>> ```
> > >>>> {DB_DOCS_NS} / {DOC_KEY} / {JSON_PATH} / {PART_IDX}
> > >>>> ```
> > >>>>
> > >>>> Since all parts of the documents are stored under a common
> `{DB_DOCS_NS} / {DOC_KEY}` they will be stored on the same server most of
> the time. The document can be retrieved by using range query
> (`txn.get_range("{DB_DOCS_NS} / {DOC_KEY} / 0", "{DB_DOCS_NS} / {DOC_KEY} /
> 0xFF")`). We can reconstruct the document since the mapping is returned as
> well.
> > >>>>
> > >>>> The downside of this approach is we wouldn't be able to ensure the
> same order of keys in the JSON object. Currently the `jiffy` JSON encoder
> respects order of keys.
> > >>>> ```
> > >>>> 4> jiffy:encode({[{bbb, 1}, {aaa, 12}]}).
> > >>>> <<"{\"bbb\":1,\"aaa\":12}">>
> > >>>> 5> jiffy:encode({[{aaa, 12}, {bbb, 1}]}).
> > >>>> <<"{\"aaa\":12,\"bbb\":1}">>
> > >>>> ```
> > >>>>
> > >>>> Best regards,
> > >>>> iilyak
> > >>>>
> > >>>>> On 2019/01/30 13:02:57, Ilya Khlopotov <iil...@apache.org> wrote:
> > >>>>> As you might already know the FoundationDB has a number of
> limitations which influences the way we might store JSON documents. The
> limitations are:
> > >>>>>
> > >>>>> |      limitation             |recommended value|recommended
> max|absolute max|
> > >>>>>
> |-------------------------|----------------------:|--------------------:|--------------:|
> > >>>>> | transaction duration  |                              |
>                  |      5 sec      |
> > >>>>> | transaction data size |                              |
>                  |      10 Mb     |
> > >>>>> | key size                   |                 32 bytes |
>          1 kB  |     10 kB      |
> > >>>>> | value size                |                               |
>             10 kB |    100 kB     |
> > >>>>>
> > >>>>> In order to fit the JSON document into 100kB we would have to
> partition it in some way. There are three ways of partitioning the document
> > >>>>> 1. store multiple binary blobs (parts) in different keys
> > >>>>> 2. flatten JSON structure and store every path leading to a scalar
> value under own key
> > >>>>> 3. measure the size of different branches of a tree representing
> the JSON document (while we parse) and use another key for the branch when
> we about to exceed the limit
> > >>>>>
> > >>>>> - The first approach is the simplest but it wouldn't allow us to
> access parts of the document.
> > >>>>> - The downsides of a second approach are:
> > >>>>> - flattened JSON structure would have long paths which means
> longer keys
> > >>>>> - the scalar value cannot be more than 100kb (unless we split it
> as well)
> > >>>>> - Third approach falls short in cases when the structure of the
> document doesn't allow a clean cut off branches:
> > >>>>> - complex rules to handle all corner cases
> > >>>>>
> > >>>>> The goals of this thread are:
> > >>>>> - to collect ideas on how to encode and store the JSON document
> > >>>>> - to comment on the collected ideas
> > >>>>>
> > >>>>> Non goals:
> > >>>>> - the storage of metadata for the document would be discussed
> elsewhere
> > >>>>> - thumb stones
> > >>>>> - edit conflicts
> > >>>>> - revisions
> > >>>>>
> > >>>>> Best regards,
> > >>>>> iilyak
> > >>>>>
> > >>>
> > >>> --
> > >>> Professional Support for Apache CouchDB:
> > >>> https://neighbourhood.ie/couchdb-support/
> > >>>
> > >>
> > >> --
> > >> Professional Support for Apache CouchDB:
> > >> https://neighbourhood.ie/couchdb-support/ <
> https://neighbourhood.ie/couchdb-support/>
> >
> >
>

Re: [DISCUSS] : things we need to solve/decide : storing JSON documents

Reply via email to