Hi, On 9 July 2015 at 10:33, Thomas Mueller <[email protected]> wrote:
> Hi, > > Using MongoDB indexes directly doesn't work because of the MVCC model. > What we could do is add special collections (basically one collection per > index). This would requires some work, which then would need to be > repeated for RDBMK. It would be quite some work. > ok, understood. > > > I observe that 60% of the size of the nodes collection is attributable > >to /oak:index > > Could you try to find out which index(es) are responsible for that? Marcel and Chetan have been working on the repository I was observing. I am sure they can point you to the details offline, if you are not aware of it already. They were able to remove about 25% of the 60% under /oak:index, but IIUC most of the remainder and not local customisations, and perhaps 40% of what remains is not local customisations and must be synchronous, which indicates a 1:2 ratio between real content nodes and MongoDB documents before any MongoDB indexes are considered. That ratio was the motivation for asking the question. Chetan thought I should discuss on oak-dev. Marcel and Chetan have executed 0) and 1) below, far more knowledgable than I in this area. Best Regards Ian > There > would be multiple ways to reduce the number of nodes: > > 0) remove unused indexes > 1) convert some indexes to Lucene property indexes 2) convert to unique index if possible (as this uses less space) 3) add a feature to only index a subset of the keys (only index what we > need) > 4) convert the last x levels of the index structure as a property instead > of as a node > > > 3) and 4) would require changes in Oak. For 4), the change should reduce > the number of nodes, but might cause merge conflicts (not sure). With > level = 1, it would be: > > /content/products/a @color=red > /content/products/b @color=red > > /oak:index/color/red/content > /oak:index/color/red/content/products @a = true, @b = true > > instead of > > /oak:index/color/red/content > /oak:index/color/red/content/products > /oak:index/color/red/content/products/a @match = true > /oak:index/color/red/content/products/b @match = true > > With level > 1, it would require some escaping magic, but we could save > some more nodes, and basically it would be: > > level = 2: > > /oak:index/color/red/content @products_a = true, @products_b = true > > > level = 3: > > /oak:index/color/red @content_products_a = true, @content_products_b = > true > > > > > Regards, > Thomas > > > > > > On 08/07/15 18:18, "Ian Boston" <[email protected]> wrote: > > >Hi, > >I am confused at how /oak:index works and why it is needed in a MongoDB > >setting which has native database indexes that appear to cover the same > >functionality. Could the Oak Query engine use DB indexes directly for all > >indexes that are built into Oak, and Lucene indexes for all custom > >indexes ? > > > >I am asking this because in MongoDB I observe that 60% of the size of the > >nodes collection is attributable to /oak:index, and that the 60% increases > >every non sparse MongoDB index by about 3x. An _id + _modified compound > >index in MongoDB comes out at about 70GB for 100M documents (in part due > >to > >the size of _id). Without the duplication /oak:index it could be closer to > >25GB. Disk space is cheap, but MongoDB working set RAM is not cheap, > >neither is page fault IO. > > > >I fully understand why TarMK needs /oak:index, but I can't understand > >(conceptually) the need to implement an index inside an database table. > >It's like trying to implement an inverted index in an RDBMS table, which > >everyone who has ever tried (or used) that approach doesn't scale nearly > >as > >far as Lucene bitmaps. > > > >Could /oak:index be replaced by something that doesn't generate > >Documents/db rows as fast as it does ? > > > >Best Regards > >Ian > >
