Re: [DISCUSS] : things we need to solve/decide : storing JSON documents

Adam Kocoloski Wed, 30 Jan 2019 08:59:09 -0800

Jan, I don’t think it does have that "fun property #2", as the mapping is 
created separately for each document. In this proposal the field name “foo” 
could map to 2 in one document and 42 in another.


Thanks for the proposal Ilya. Personally I wonder if the 10KB limit on field 
paths is anything more than a theoretical concern. It’s hard for me to imagine 
a useful schema that would get anywhere near that deep, but maybe I’m 
insufficiently creative :) There’s certainly a storage overhead from repeating 
the upper portion of a path over and over again, but that’s also something the 
storage engine can optimize away through prefix elision. The current production 
storage engine in FoundationDB does not do this elision, but the new one in 
development does.

The value size limit is probably not so theoretical. I think as a project we 
could choose to impose a 100KB size limit on scalar values - a user who had a 
string longer than 100KB could chunk it up into an array of strings pretty 
easily to work around that limit. But let’s say we don’t want to impose that 
limit. In your design, how do I distinguish {PART_IDX} from the elements of the 
{JSON_PATH}? I was kind of expecting to see some magic value indicating that 
the subsequent set of keys with the same prefix are all elements of a 
“multi-part object”:

{DB_DOCS_NS} / {DOC_KEY} / {JSON_PATH}  = kMULTIPART
{DB_DOCS_NS} / {DOC_KEY} / {JSON_PATH} / {PART_IDX}  = “First 100 KB …"
...

You might have figured out something more efficient that saves a KV here but I 
can’t quite grok it.

Cheers, Adam


> On Jan 30, 2019, at 8:24 AM, Jan Lehnardt <[email protected]> wrote:
> 
> 
> 
>> On 30. Jan 2019, at 14:22, Jan Lehnardt <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Thanks Ilya for getting this started!
>> 
>> Two quick notes on this one:
>> 
>> 1. note that JSON does not guarantee object key order and that CouchDB has 
>> never guaranteed it either, and with say emit(doc.foo, doc.bar), if either 
>> emit() parameter was an object, the undefined-sort-order of SpiderMonkey 
>> would mix things up. While worth bringing up, this is not a BC break.
>> 
>> 2. This would have the fun property of being able to rename a key inside all 
>> docs that have that key.
> 
> …in one short operation.
> 
> Best
> Jan
> —
>> 
>> Best
>> Jan
>> —
>> 
>>> On 30. Jan 2019, at 14:05, Ilya Khlopotov <[email protected]> wrote:
>>> 
>>> # First proposal
>>> 
>>> In order to overcome FoudationDB limitations on key size (10 kB) and value 
>>> size (100 kB) we could use the following approach.
>>> 
>>> Bellow the paths are using slash for illustration purposes only. We can use 
>>> nested subspaces, tuples, directories or something else. 
>>> 
>>> - Store documents in a subspace or directory  (to keep prefix for a key 
>>> short)
>>> - When we store the document we would enumerate all field names (0 and 1 
>>> are reserved) and store the mapping table in the key which look like:
>>> ```
>>> {DB_DOCS_NS} / {DOC_KEY} / 0
>>> ```
>>> - Flatten the JSON document (convert it into key value pairs where the key 
>>> is `JSON_PATH` and value is `SCALAR_VALUE`)
>>> - Replace elements of JSON_PATH with integers from mapping table we 
>>> constructed earlier
>>> - When we have array use `1 / {array_idx}`
>>> - Store scalar values in the keys which look like the following (we use 
>>> `JSON_PATH` with integers). 
>>> ```
>>> {DB_DOCS_NS} / {DOC_KEY} / {JSON_PATH}
>>> ```
>>> - If the scalar value exceeds 100kB we would split it and store every part 
>>> under key constructed as:
>>> ```
>>> {DB_DOCS_NS} / {DOC_KEY} / {JSON_PATH} / {PART_IDX}
>>> ```
>>> 
>>> Since all parts of the documents are stored under a common `{DB_DOCS_NS} / 
>>> {DOC_KEY}` they will be stored on the same server most of the time. The 
>>> document can be retrieved by using range query 
>>> (`txn.get_range("{DB_DOCS_NS} / {DOC_KEY} / 0", "{DB_DOCS_NS} / {DOC_KEY} / 
>>> 0xFF")`). We can reconstruct the document since the mapping is returned as 
>>> well.
>>> 
>>> The downside of this approach is we wouldn't be able to ensure the same 
>>> order of keys in the JSON object. Currently the `jiffy` JSON encoder 
>>> respects order of keys.
>>> ```
>>> 4> jiffy:encode({[{bbb, 1}, {aaa, 12}]}).
>>> <<"{\"bbb\":1,\"aaa\":12}">>
>>> 5> jiffy:encode({[{aaa, 12}, {bbb, 1}]}).
>>> <<"{\"aaa\":12,\"bbb\":1}">>
>>> ```
>>> 
>>> Best regards,
>>> iilyak
>>> 
>>> On 2019/01/30 13:02:57, Ilya Khlopotov <[email protected]> wrote: 
>>>> As you might already know the FoundationDB has a number of limitations 
>>>> which influences the way we might store JSON documents. The limitations 
>>>> are:
>>>> 
>>>> |      limitation             |recommended value|recommended max|absolute 
>>>> max|
>>>> |-------------------------|----------------------:|--------------------:|--------------:|
>>>> | transaction duration  |                              |                   
>>>>         |      5 sec      |
>>>> | transaction data size |                              |                   
>>>>         |      10 Mb     |
>>>> | key size                   |                 32 bytes |                  
>>>>  1 kB  |     10 kB      |
>>>> | value size                |                               |              
>>>>     10 kB |    100 kB     |
>>>> 
>>>> In order to fit the JSON document into 100kB we would have to partition it 
>>>> in some way. There are three ways of partitioning the document
>>>> 1. store multiple binary blobs (parts) in different keys
>>>> 2. flatten JSON structure and store every path leading to a scalar value 
>>>> under own key
>>>> 3. measure the size of different branches of a tree representing the JSON 
>>>> document (while we parse) and use another key for the branch when we about 
>>>> to exceed the limit
>>>> 
>>>> - The first approach is the simplest but it wouldn't allow us to access 
>>>> parts of the document.
>>>> - The downsides of a second approach are:
>>>> - flattened JSON structure would have long paths which means longer keys
>>>> - the scalar value cannot be more than 100kb (unless we split it as well)
>>>> - Third approach falls short in cases when the structure of the document 
>>>> doesn't allow a clean cut off branches:
>>>> - complex rules to handle all corner cases
>>>> 
>>>> The goals of this thread are:
>>>> - to collect ideas on how to encode and store the JSON document
>>>> - to comment on the collected ideas
>>>> 
>>>> Non goals:
>>>> - the storage of metadata for the document would be discussed elsewhere
>>>> - thumb stones
>>>> - edit conflicts
>>>> - revisions 
>>>> 
>>>> Best regards,
>>>> iilyak
>>>> 
>> 
>> -- 
>> Professional Support for Apache CouchDB:
>> https://neighbourhood.ie/couchdb-support/
>> 
> 
> -- 
> Professional Support for Apache CouchDB:
> https://neighbourhood.ie/couchdb-support/ 
> <https://neighbourhood.ie/couchdb-support/>

Re: [DISCUSS] : things we need to solve/decide : storing JSON documents

Reply via email to