Re: normalising the rdb database schema

Tomek Rekawek Wed, 17 Aug 2016 07:16:47 -0700

Hi,

thanks for the feedback. With regards to the potential performance / indexing 
issues I agree that having more rows in the table will certainly affect the 
performance. On the other hand, I think that storing and querying a large 
number of rows is the exact thing that the database engines should be optimized 
for (or maybe it’s just a wishful thinking). Anyway, I guess it’s hard to tell 
anything until we have a PoC. Starting this thread I was wondering whether 
there’s some obvious reason that we should stick to the current schema and 
since all of the replies started with “I like the idea, but…” I think it’s 
worth to proceed and at least try.


I spend some time researching the JSON support in DBEs. I think that such 
support was primarily added to allow querying the JSON documents loaded to the 
database. Eg. the Oracle / MS SQL / DB2 doesn’t even allow to modify values 
stored in the JSON fields. On the other hand, what we need is a support for the 
complex updates, so we don’t have to load a full document in order to apply 
UpdateOps on it. In the current implementation it can be done by appending the 
serialised diffs. If we switch to using a single JSON document per node we 
won’t be able to update them in a single call. Instead we’ll get a possibility 
to query the documents, which we don’t really need. Because of that I’m a bit 
skeptical to using this feature.

Best regards,
Tomek

-- 
Tomek Rękawek | Adobe Research | www.adobe.com
[email protected]

> On 17 Aug 2016, at 06:09, Chetan Mehrotra <[email protected]> wrote:
> 
> Hi Tomek,
> 
> I like the idea of revisiting our current schema based on usage so
> far. However couple of points around potential issue with such a
> normalized approach
> 
> - This approach would lead to a thin and loooong table. As noted in
> [1] in a small repo ~14 M nodes we have ~26 M properties. With
> multiple revisions (GC takes some time) this can go higher. This would
> then increase the memory requirement for id index. Memory consumption
> increases further with id+key+revision index. For any db to perform
> optimally the index should fit in ram. So such such a design would
> possibly reduce the max size of repository which can be supported
> (compared to older one) for given memory
> 
> - The read for specific id can be done in 1 remote call. But that
> would involve select across multiple rows which might increase the
> time taken as it would involve 'm' index lookup and then 'm' reads of
> row data for any node having 'n' properties (m > n assuming multiple
> revision for property present)
> 
> May be we should explore the json support being introduced in multiple
> dbs.  DB2 [2], SQL Server [3], Oracle [4], Postgres [5], MySql [6].
> Problem here is that we would need DB specific implementation and also
> increases the testing effort!
> 
>> we can better use the database features, as now the DBE is aware about the 
>> document internal structure (it’s not a blob anymore). Eg. we can fetch only 
>> a few properties.
> 
> In most cases the kind of properties stored in blob part of db row are
> always read as a whole.
> 
> Chetan Mehrotra
> [1] https://issues.apache.org/jira/browse/OAK-4471
> [2] 
> http://www.ibm.com/developerworks/data/library/techarticle/dm-1306nosqlforjson1/
> [3] https://msdn.microsoft.com/en-in/library/dn921897.aspx
> [4] https://docs.oracle.com/database/121/ADXDB/json.htm
> [5] https://www.postgresql.org/docs/9.3/static/functions-json.html
> [6] https://dev.mysql.com/doc/refman/5.7/en/json.html
> 
> 
> On Wed, Aug 17, 2016 at 7:19 AM, Michael Marth <[email protected]> wrote:
>> Hi Tomek,
>> 
>> I like the idea (agree with Vikas’ comments / cautions as well).
>> 
>> You are hinting at expected performance differences (maybe faster or slower 
>> than the current approach). That would probably be worthwhile to investigate 
>> in order to assess your idea.
>> 
>> One more (hypothetical at this point) advantage of your approach: we could 
>> utilise DB-native indexes as a replacement for property indexes.
>> 
>> Cheers
>> Michael
>> 
>> 
>> 
>> On 16/08/16 07:42, "Tomek Rekawek" <[email protected]> wrote:
>> 
>>> Hi Vikas,
>>> 
>>> thanks for the reply.
>>> 
>>>> On 16 Aug 2016, at 14:38, Vikas Saurabh <[email protected]> wrote:
>>> 
>>>> * It'd incur a very heavy migration impact on upgrade or RDB setups -
>>>> that, most probably, would translate to us having to support both
>>>> schemas. I don't feel that it'd easy to flip the switch for existing
>>>> setups.
>>> 
>>> That’s true. I think we should take a similar approach here as with the 
>>> segment / segment-tar implementations (and we can use oak-upgrade to 
>>> convert between them). At least for now.
>>> 
>>>> * DocumentNodeStore implementation very freely touches prop:rev=value
>>>> for a given id… […] I think this would get
>>>> expensive for index (_id+propName+rev) maintenance.
>>> 
>>> Indeed, probably we’ll have to analyse the indexing capabilities offered by 
>>> different database engines more closely, choosing the one that offers good 
>>> writing speed.
>>> 
>>> Best regards,
>>> Tomek
>>> 
>>> --
>>> Tomek Rękawek | Adobe Research | www.adobe.com
>>> [email protected]

smime.p7s
Description: S/MIME cryptographic signature

Re: normalising the rdb database schema

Reply via email to