Hi, thanks for the feedback. With regards to the potential performance / indexing issues I agree that having more rows in the table will certainly affect the performance. On the other hand, I think that storing and querying a large number of rows is the exact thing that the database engines should be optimized for (or maybe it’s just a wishful thinking). Anyway, I guess it’s hard to tell anything until we have a PoC. Starting this thread I was wondering whether there’s some obvious reason that we should stick to the current schema and since all of the replies started with “I like the idea, but…” I think it’s worth to proceed and at least try.
I spend some time researching the JSON support in DBEs. I think that such support was primarily added to allow querying the JSON documents loaded to the database. Eg. the Oracle / MS SQL / DB2 doesn’t even allow to modify values stored in the JSON fields. On the other hand, what we need is a support for the complex updates, so we don’t have to load a full document in order to apply UpdateOps on it. In the current implementation it can be done by appending the serialised diffs. If we switch to using a single JSON document per node we won’t be able to update them in a single call. Instead we’ll get a possibility to query the documents, which we don’t really need. Because of that I’m a bit skeptical to using this feature. Best regards, Tomek -- Tomek Rękawek | Adobe Research | www.adobe.com [email protected] > On 17 Aug 2016, at 06:09, Chetan Mehrotra <[email protected]> wrote: > > Hi Tomek, > > I like the idea of revisiting our current schema based on usage so > far. However couple of points around potential issue with such a > normalized approach > > - This approach would lead to a thin and loooong table. As noted in > [1] in a small repo ~14 M nodes we have ~26 M properties. With > multiple revisions (GC takes some time) this can go higher. This would > then increase the memory requirement for id index. Memory consumption > increases further with id+key+revision index. For any db to perform > optimally the index should fit in ram. So such such a design would > possibly reduce the max size of repository which can be supported > (compared to older one) for given memory > > - The read for specific id can be done in 1 remote call. But that > would involve select across multiple rows which might increase the > time taken as it would involve 'm' index lookup and then 'm' reads of > row data for any node having 'n' properties (m > n assuming multiple > revision for property present) > > May be we should explore the json support being introduced in multiple > dbs. DB2 [2], SQL Server [3], Oracle [4], Postgres [5], MySql [6]. > Problem here is that we would need DB specific implementation and also > increases the testing effort! > >> we can better use the database features, as now the DBE is aware about the >> document internal structure (it’s not a blob anymore). Eg. we can fetch only >> a few properties. > > In most cases the kind of properties stored in blob part of db row are > always read as a whole. > > Chetan Mehrotra > [1] https://issues.apache.org/jira/browse/OAK-4471 > [2] > http://www.ibm.com/developerworks/data/library/techarticle/dm-1306nosqlforjson1/ > [3] https://msdn.microsoft.com/en-in/library/dn921897.aspx > [4] https://docs.oracle.com/database/121/ADXDB/json.htm > [5] https://www.postgresql.org/docs/9.3/static/functions-json.html > [6] https://dev.mysql.com/doc/refman/5.7/en/json.html > > > On Wed, Aug 17, 2016 at 7:19 AM, Michael Marth <[email protected]> wrote: >> Hi Tomek, >> >> I like the idea (agree with Vikas’ comments / cautions as well). >> >> You are hinting at expected performance differences (maybe faster or slower >> than the current approach). That would probably be worthwhile to investigate >> in order to assess your idea. >> >> One more (hypothetical at this point) advantage of your approach: we could >> utilise DB-native indexes as a replacement for property indexes. >> >> Cheers >> Michael >> >> >> >> On 16/08/16 07:42, "Tomek Rekawek" <[email protected]> wrote: >> >>> Hi Vikas, >>> >>> thanks for the reply. >>> >>>> On 16 Aug 2016, at 14:38, Vikas Saurabh <[email protected]> wrote: >>> >>>> * It'd incur a very heavy migration impact on upgrade or RDB setups - >>>> that, most probably, would translate to us having to support both >>>> schemas. I don't feel that it'd easy to flip the switch for existing >>>> setups. >>> >>> That’s true. I think we should take a similar approach here as with the >>> segment / segment-tar implementations (and we can use oak-upgrade to >>> convert between them). At least for now. >>> >>>> * DocumentNodeStore implementation very freely touches prop:rev=value >>>> for a given id… […] I think this would get >>>> expensive for index (_id+propName+rev) maintenance. >>> >>> Indeed, probably we’ll have to analyse the indexing capabilities offered by >>> different database engines more closely, choosing the one that offers good >>> writing speed. >>> >>> Best regards, >>> Tomek >>> >>> -- >>> Tomek Rękawek | Adobe Research | www.adobe.com >>> [email protected]
smime.p7s
Description: S/MIME cryptographic signature
