Re: normalising the rdb database schema

Mike Nimer Wed, 17 Aug 2016 09:07:57 -0700

This reminds me of a system I worked on years ago that had a similar
problem and solved it with a side table.  It would store the object in the
DB as XML (I told you it was long time ago) and then we had a table on the
side that stored each property in a row, if it's was marked as indexed.
The full XML object gave us quick read access. The side table allowed us to
write complex sql to find objects. Although I will say the sql could get
nasty with this table structure, lot's of self-referencing joins.


This might be crazy, but what if oak did something similar? The current
data could still be stored as a JSON object, like it is now, as the source
of truth.  And then an Async task could be used update the side table with
all properties or only indexed properties of each node.  This could be
disabled for very large implementations or you could flip a flag to store
all revisions or only the latest. This table could even be considered
temporary and rebuilt as needed.

With that said, the new JSON support being added to the different databases
looks very promising and might solve the same problem.
--mike




On Tue, Aug 16, 2016 at 11:09 PM, Chetan Mehrotra <[email protected]
> wrote:

> Hi Tomek,
>
> I like the idea of revisiting our current schema based on usage so
> far. However couple of points around potential issue with such a
> normalized approach
>
> - This approach would lead to a thin and loooong table. As noted in
> [1] in a small repo ~14 M nodes we have ~26 M properties. With
> multiple revisions (GC takes some time) this can go higher. This would
> then increase the memory requirement for id index. Memory consumption
> increases further with id+key+revision index. For any db to perform
> optimally the index should fit in ram. So such such a design would
> possibly reduce the max size of repository which can be supported
> (compared to older one) for given memory
>
> - The read for specific id can be done in 1 remote call. But that
> would involve select across multiple rows which might increase the
> time taken as it would involve 'm' index lookup and then 'm' reads of
> row data for any node having 'n' properties (m > n assuming multiple
> revision for property present)
>
> May be we should explore the json support being introduced in multiple
> dbs.  DB2 [2], SQL Server [3], Oracle [4], Postgres [5], MySql [6].
> Problem here is that we would need DB specific implementation and also
> increases the testing effort!
>
> > we can better use the database features, as now the DBE is aware about
> the document internal structure (it’s not a blob anymore). Eg. we can fetch
> only a few properties.
>
> In most cases the kind of properties stored in blob part of db row are
> always read as a whole.
>
> Chetan Mehrotra
> [1] https://issues.apache.org/jira/browse/OAK-4471
> [2] http://www.ibm.com/developerworks/data/library/techarticle/dm-
> 1306nosqlforjson1/
> [3] https://msdn.microsoft.com/en-in/library/dn921897.aspx
> [4] https://docs.oracle.com/database/121/ADXDB/json.htm
> [5] https://www.postgresql.org/docs/9.3/static/functions-json.html
> [6] https://dev.mysql.com/doc/refman/5.7/en/json.html
>
>
> On Wed, Aug 17, 2016 at 7:19 AM, Michael Marth <[email protected]> wrote:
> > Hi Tomek,
> >
> > I like the idea (agree with Vikas’ comments / cautions as well).
> >
> > You are hinting at expected performance differences (maybe faster or
> slower than the current approach). That would probably be worthwhile to
> investigate in order to assess your idea.
> >
> > One more (hypothetical at this point) advantage of your approach: we
> could utilise DB-native indexes as a replacement for property indexes.
> >
> > Cheers
> > Michael
> >
> >
> >
> > On 16/08/16 07:42, "Tomek Rekawek" <[email protected]> wrote:
> >
> >>Hi Vikas,
> >>
> >>thanks for the reply.
> >>
> >>> On 16 Aug 2016, at 14:38, Vikas Saurabh <[email protected]>
> wrote:
> >>
> >>> * It'd incur a very heavy migration impact on upgrade or RDB setups -
> >>> that, most probably, would translate to us having to support both
> >>> schemas. I don't feel that it'd easy to flip the switch for existing
> >>> setups.
> >>
> >>That’s true. I think we should take a similar approach here as with the
> segment / segment-tar implementations (and we can use oak-upgrade to
> convert between them). At least for now.
> >>
> >>> * DocumentNodeStore implementation very freely touches prop:rev=value
> >>> for a given id… […] I think this would get
> >>> expensive for index (_id+propName+rev) maintenance.
> >>
> >>Indeed, probably we’ll have to analyse the indexing capabilities offered
> by different database engines more closely, choosing the one that offers
> good writing speed.
> >>
> >>Best regards,
> >>Tomek
> >>
> >>--
> >>Tomek Rękawek | Adobe Research | www.adobe.com
> >>[email protected]
>

Re: normalising the rdb database schema

Reply via email to