Hi Norberto,

Thank you. That saved me a lot of time, and I learnt something in the
process.

So in your opinion, is there anything that can or should be done in the
DocumentNodeStore from a schema point of view to improve the read or write
performance of Oak on MongoDB without resorting to sharding or upgrading to
3.0 and WiredTiger ?
I am interested in JCR nodes not including blobs.

Best Regards
Ian

On 12 June 2015 at 18:54, Norberto Leite <[email protected]> wrote:

> Hi Ian,
>
> indexes are bound per collection. That means that if you have a large
> collection that index will be correspondingly large. In the case of *_id*
> which
> is the primary key of all collections on MongoDB this is proportional to
> the number of documents that you contain per collection.
> Having a large data spread across different collections makes those indexes
> individually smaller but in combination larger (we need to account for the
> overhead of each index entries and some header information that composes
> the indexes).
> Also take into account that every time you switch between collections to
> perform different queries (there are no joins in MongoDB) you will need to
> reload to memory the index structure of all individual collections affected
> by your query, which comes with some penalties, if you do not have enough
> space in ram for the full amount.
> That said, in MongoDB all information is handled using one single big file
> per database (although spread across different extensions on disk) on
> storage engine MMApv1 (current default for both 3.0 and 2.6). With
> WiredTiger this is broke down to individual files per collection and per
> index structure.
>
> Bottom line is, if there would be a marginal benefit for insert rates if
> you break the JCR nodes collection into different collections due to the
> fact that per insert you would have smaller index and data structures to
> transverse and update, but a lot more inefficiencies on the query part
> since you would be page faulting more often to address the traverse
> required on both indexes and collection data.
>
> So yes, Chetan is right by stating that the actual size occupied by the
> indexes would not be smaller, it would actually increase.
>
> What is important to mention is that sharding takes care of this by
> spreading the load between instances and this reflects immediately both on
> the size of the data that each individual shard would have to handle
> (smaller data collections = smaller indexes) and allows paralleled workload
> while retrieving back the query requests.
>
> Another aspect to considered is that fragmentation of the data set will
> affect reads and writes on the long term. I'm going to be delivering a talk
> soon at http://www.connectcon.ch/2015/en.html where I address this (If you
> are interested on attending) on how to handled and detect these situations
> on JCR implementations.
>
> To complete the description, the concurrency control mechanism (often
> quoted by locking) is more granular in 3.0 MMApv1 implementation, going
> from database level to collection.
>
>
> N.
>
> On Fri, Jun 12, 2015 at 7:31 PM, Ian Boston <[email protected]> wrote:
>
> > H Norberto,
> >
> > Thank you for the feedback on the questions. I see you work for as an
> > Evangelist for MongoDB, so will probably know the answers, and can save
> me
> > time. I agree it's not worth doing anything about concurrency even if
> logs
> > indicate there is contention on locks in 2.6, as the added complexity
> would
> > make read things worse. If an upgrade to 3.0 has been done, anything
> > collection based makes is a waste of time due to the availability of
> > WiredTiger.
> >
> > Could you confirm that separating one large collection into a number of
> > smaller collections will not reduce the size of the indexes that have to
> be
> > consulted for queries of the form that Chetan shared earlier ?
> >
> > I'll try and clarify that question. DocumentNodeStore has 1 collection
> > containing all Documents "nodes". Some queries are only interested in a
> key
> > space representing a certain part of the "nodes" collection, eg
> > n:/largelystatic/**. If those Documents were stored in nodes_x, and
> > count(nodes_x) <= 0.001*count(nodes), would there be any performance
> > advantage or does MongoDB, under the covers, treat all collections as a
> > single massive collection from an index and query point of view ?
> >
> > If you have any pointer to how 2.6 scale relative to collection size,
> > number of collections and index size that would help me understand more
> > about its behaviour.
> >
> > Best Regards
> > Ian
> >
> >
> >
> >
> > On 12 June 2015 at 17:08, Norberto Leite <[email protected]>
> > wrote:
> >
> > > Hi Ian,
> > >
> > > Your proposal would not be very efficient.
> > > The concurrency control mechanism that 2.6 offers (current supported
> > > version), although not neglectable, would not be that beneficial on the
> > > write load. On the reading part, which we can assume is the gross
> > workload
> > > that JCR will be doing, is not affected by that.
> > > One needs to consider that every time you would be reading from the JCR
> > you
> > > either would be providing a complex M/R operation, which is designed to
> > > span out to the full amount of documents existing in a given
> collection,
> > > and would need to recur all affected collections. Not very effective.
> > >
> > > The existing mechanism is way more simple and more efficient.
> > > With the upcoming support for wired tiger, the concurrency control
> > > (potential issue) becomes totally irrelevant.
> > >
> > > Also don't forget that you cannot predict the number of child nodes
> that
> > a
> > > given system would implement to define their content tree.
> > > If you do have a very nested (on specific level) number of documents
> you
> > > would need to treat that collection separately(when needing to scale
> just
> > > shard that collection and not the others) bringing in more operational
> > > complexity.
> > >
> > > What can be a good discussion point would be to separate the blobs
> > > collection into its own database given the flexibility that JCR offers
> > when
> > > treating these 2 different data types.
> > > Actually, this reminded me that I was pending on submitting a jira
> > request
> > > on this matter <https://issues.apache.org/jira/browse/OAK-2984>.
> > >
> > > As Chetan is mentioning, sharding comes into play once we have to scale
> > the
> > > write throughput of the system.
> > >
> > > N.
> > >
> > >
> > > On Fri, Jun 12, 2015 at 4:15 PM, Chetan Mehrotra <
> > > [email protected]>
> > > wrote:
> > >
> > > > On Fri, Jun 12, 2015 at 7:32 PM, Ian Boston <[email protected]> wrote:
> > > > > Initially I was thinking about the locking behaviour but I realises
> > > 2.6.*
> > > > > is still locking at a database level, and that only changes to at a
> > > > > collection level 3.0 with MMAPv1 and row if you switch to
> WiredTiger
> > > [1].
> > > >
> > > > I initially thought the same and then we benchmarked the throughput
> by
> > > > placing the BlobStore in a separate database (OAK-1153). But did not
> > > > observed any significant gains. So that approach was not pursued
> > > > further. If we have some benchmark which can demonstrate that write
> > > > throughput increases if we _shard_ node collection into separate
> > > > database on same server then we can look further there
> > > >
> > > > Chetan Mehrotra
> > > >
> > >
> >
>

Reply via email to