I want to fix previous mistakes. I did two mistakes in previous calculations: - I used 1Kb as base size for calculating expansion factor (although we don't know exact size of original document) - The expansion factor calculation included number of revisions (it shouldn't)
I'll focus on flattened JSON docs model The following formula is used in previous calculation. storage_size_per_document=mapping_table_size*number_of_revisions + depth*number_of_paths*number_of_revisions + number_of_paths*value_size*number_of_revisions To clarify things a little bit I want to calculate space requirement for single revision this time. mapping_table_size=field_name_size*(field_name_length+4(integer size))=100 * (20 + 4(integer size)) = 2400 bytes storage_size_per_document_per_revision_per_replica=mapping_table_size + depth*number_of_paths + value_size*number_of_paths = 2400bytes + 10*1000+1000*100=112400bytes~=110 Kb We definitely can reduce requirement for mapping table by adopting rnewson's idea of a schema. On 2019/02/04 11:08:16, Ilya Khlopotov <iil...@apache.org> wrote: > Hi Michael, > > > For example, hears a crazy thought: > > Map every distinct occurence of a key/value instance through a crypto hash > > function to get a set of hashes. > > > > These can be be precomputed by Couch without any lookups in FDB. These > > will be spread all over kingdom come in FDB and not lend themselves to > > range search well. > > > > So what you do is index them for frequency of occurring in the same set. > > In essence, you 'bucket them' statistically, and that bucket id becomes a > > key prefix. A crypto hash value can be copied into more than one bucket. > > The {bucket_id}/{cryptohash} becomes a {val_id} > > > When writing a document, Couch submits the list/array of cryptohash values > > it computed to FDB and gets back the corresponding {val_id} (the id with > > the bucket prefixed). This can get somewhat expensive if there's always a > > lot of app local cache misses. > > > > A document's value is then a series of {val_id} arrays up to 100k per > > segment. > > > > When retrieving a document, you get the val_ids, find the distinct buckets > > and min/max entries for this doc, and then parallel query each bucket while > > reconstructing the document. > > Interesting idea. Let's try to think it through to see if we can make it > viable. > Let's go through hypothetical example. Input data for the example: > - 1M of documents > - each document is around 10Kb > - each document consists of 1K of unique JSON paths > - each document has 100 unique JSON field names > - every scalar value is 100 bytes > - 10% of unique JSON paths for every document already stored in database > under different doc or different revision of the current one > - we assume 3 independent copies for every key-value pair in FDB > - our hash key size is 32 bytes > - let's assume we can determine if key is already on the storage without > doing query > - 1% of paths is in cache (unrealistic value, in real live the percentage is > lower) > - every JSON field name is 20 bytes > - every JSON path is 10 levels deep > - document key prefix length is 50 > - every document has 10 revisions > Let's estimate the storage requirements and size of data we need to transmit. > The calculations are not exact. > 1. storage_size_per_document (we cannot estimate exact numbers since we don't > know how FDB stores it) > - 10 * ((10Kb - (10Kb * 10%)) + (1K - (1K * 10%)) * 32 bytes) = 38Kb * 10 * > 3 = 1140 Kb (11x) > 2. number of independent keys to retrieve on document read (non-range > queries) per document > - 1K - (1K * 1%) = 990 > 3. number of range queries: 0 > 4. data to transmit on read: (1K - (1K * 1%)) * (100 bytes + 32 bytes) = 102 > Kb (10x) > 5. read latency (we use 2ms per read based on numbers from > https://apple.github.io/foundationdb/performance.html) > - sequential: 990*2ms = 1980ms > - range: 0 > Let's compare these numbers with initial proposal (flattened JSON docs > without global schema and without cache) > 1. storage_size_per_document > - mapping table size: 100 * (20 + 4(integer size)) = 2400 bytes > - key size: (10 * (4 + 1(delimiter))) + 50 = 100 bytes > - storage_size_per_document: 2.4K*10 + 100*1K*10 + 1K*100*10 = 2024K = 1976 > Kb * 3 = 5930 Kb (59.3x) > 2. number of independent keys to retrieve: 0-2 (depending on index structure) > 3. number of range queries: 1 (1001 of keys in result) > 4. data to transmit on read: 24K + 1000*100 + 1000*100 = 23.6 Kb (2.4x) > 5. read latency (we use 2ms per read based on numbers from > https://apple.github.io/foundationdb/performance.html and estimate range read > performance based on numbers from > https://apple.github.io/foundationdb/benchmarking.html#single-core-read-test) > - range read performance: Given read performance is about 305,000 > reads/second and range performance 3,600,000 keys/second we estimate range > performance to be 11.8x compared to read performance. If read performance is > 2ms than range performance is 0.169ms (which is hard to believe). > - sequential: 2 * 2 = 4ms > - range: 0.169 > > It looks like we are dealing with a tradeoff: > - Map every distinct occurrence of a key/value instance through a crypto hash: > - 5.39x more disk space efficient > - 474x slower > - flattened JSON model > - 5.39x less efficient in disk space > - 474x faster > > In any case this unscientific exercise was very helpful. Since it uncovered > the high cost in terms of disk space. 59.3x of original disk size is too much > IMO. > > Are the any ways we can make Michael's model more performant? > > Also I don't quite understand few aspects of the global hash table proposal: > > 1. > - Map every distinct occurence of a key/value instance through a crypto > hash function to get a set of hashes. > I think we are talking only about scalar values here? I.e. `"#/foo.bar.baz": > 123` > Since I don't know how we can make it work for all possible JSON paths > `{"foo": {"bar": {"size": 12, "baz": 123}}}": > - foo > - foo.bar > - foo.bar.baz > > 2. how to delete documents > > Best regards, > ILYA > > > On 2019/01/30 23:33:22, Michael Fair <mich...@daclubhouse.net> wrote: > > On Wed, Jan 30, 2019, 12:57 PM Adam Kocoloski <kocol...@apache.org wrote: > > > > > Hi Michael, > > > > > > > The trivial fix is to use DOCID/REVISIONID as DOC_KEY. > > > > > > Yes that’s definitely one way to address storage of edit conflicts. I > > > think there are other, more compact representations that we can explore if > > > we have this “exploded” data model where each scalar value maps to an > > > individual KV pair. > > > > > > I agree, as I mentioned on the original thread, I see a scheme, that > > handles both conflicts and revisions, where you only have to store the most > > recent change to a field. Like you suggested, multiple revisions can share > > a key. Which in my mind's eye further begs the conflicts/revisions > > discussion along with the working within the limits discussion because it > > seems to me they are all intrinsically related as a "feature". > > > > Saying 'We'll break documents up into roughly 80k segments', then trying to > > overlay some kind of field sharing scheme for revisions/conflicts doesn't > > seem like it will work. > > > > I probably should have left out the trivial fix proposal as I don't think > > it's a feasible solution to actually use. > > > > The comment is more regarding that I do not see how this thread can escape > > including how to store/retrieve conflicts/revisions. > > > > For instance, the 'doc as individual fields' proposal lends itself to value > > sharing across mutiple documents (and I don't just mean revisions of the > > same doc, I mean the same key/value instance could be shared for every > > document). > > However that's not really relevant if we're not considering the amount of > > shared information across documents in the storage scheme. > > > > Simply storing documents in <100k segments (perhaps in some kind of > > compressed binary representation) to deal with that FDB limit seems fine. > > The only reason to consider doing something else is because of its impact > > to indexing, searches, reduce functions, revisions, on-disk size impact, > > etc. > > > > > > > > > > I'm assuming the process will flatten the key paths of the document into > > > an array and then request the value of each key as multiple parallel > > > queries against FDB at once > > > > > > Ah, I think this is not one of Ilya’s assumptions. He’s trying to design a > > > model which allows the retrieval of a document with a single range read, > > > which is a good goal in my opinion. > > > > > > > I am not sure I agree. > > > > Think of bitTorrent, a single range read should pull back the structure of > > the document (the pieces to fetch), but not necessarily the whole document. > > > > What if you already have a bunch of pieces in common with other documents > > locally (a repeated header/footer/ or type for example); and you only need > > to get a few pieces of data you don't already have? > > > > The real goal to Couch I see is to treat your document set like the > > collection of structured information that it is. In some respects like an > > extension of your application's heap space for structured objects and > > efficiently querying that collection to get back subsets of the data. > > > > Otherwise it seems more like a slightly upgraded file system plus a fancy > > grep/find like feature... > > > > The best way I see to unlock more features/power is to a move towards a > > more granular and efficient way to store and retrieve the scalar values... > > > > > > > > For example, hears a crazy thought: > > Map every distinct occurence of a key/value instance through a crypto hash > > function to get a set of hashes. > > > > These can be be precomputed by Couch without any lookups in FDB. These > > will be spread all over kingdom come in FDB and not lend themselves to > > range search well. > > > > So what you do is index them for frequency of occurring in the same set. > > In essence, you 'bucket them' statistically, and that bucket id becomes a > > key prefix. A crypto hash value can be copied into more than one bucket. > > The {bucket_id}/{cryptohash} becomes a {val_id} > > > > When writing a document, Couch submits the list/array of cryptohash values > > it computed to FDB and gets back the corresponding {val_id} (the id with > > the bucket prefixed). This can get somewhat expensive if there's always a > > lot of app local cache misses. > > > > > > A document's value is then a series of {val_id} arrays up to 100k per > > segment. > > > > When retrieving a document, you get the val_ids, find the distinct buckets > > and min/max entries for this doc, and then parallel query each bucket while > > reconstructing the document. > > > > The values returned from the buckets query are the key/value strings > > required to reassemble this document. > > > > > > ---------- > > I put this forward primarily to hilite the idea that trying to match the > > storage representation of documents in a straight forward way to FDB keys > > to reduce query count might not be the most performance oriented approach. > > > > I'd much prefer a storage approach that reduced data duplication and > > enabled fast sub-document queries. > > > > > > This clearly falls in the realm of what people want the 'use case' of Couch > > to be/become. By giving Couch more access to sub-document queries, I could > > eventually see queries as complicated as GraphQL submitted to Couch and > > pulling back ad-hoc aggregated data across multiple documents in a single > > application layer request. > > > > Hehe - one way to look at the database of Couch documents is that they are > > all conflict revisions of the single root empty document. What I mean be > > this is consider thinking of the entire document store as one giant DAG of > > key/value pairs. How even separate documents are still typically related to > > each other. For most applications there is a tremendous amount of data > > redundancy between docs and especially between revisions of those docs... > > > > > > > > And all this is a long way of saying "I think there could be a lot of value > > in assuming documents are 'assembled' from multiple queries to FDB, with > > local caching, instead of simply retrieved" > > > > Thanks, I hope I'm not the only outlier here thinking this way!? > > > > Mike :-) > > >