A collection per index (or a separate one for indexes only), specially the asynchronous ones, will translate in a big benefit if the following occurs: - when querying on index nodes we don't need to get all related node documents (which is happening) - the write operations are distinct between indexes and nodes (which I think is also happening)
N. On Thu, Jul 9, 2015 at 11:33 AM, Thomas Mueller <muel...@adobe.com> wrote: > Hi, > > Using MongoDB indexes directly doesn't work because of the MVCC model. > What we could do is add special collections (basically one collection per > index). This would requires some work, which then would need to be > repeated for RDBMK. It would be quite some work. > > > I observe that 60% of the size of the nodes collection is attributable > >to /oak:index > > Could you try to find out which index(es) are responsible for that? There > would be multiple ways to reduce the number of nodes: > > 0) remove unused indexes > 1) convert some indexes to Lucene property indexes > 2) convert to unique index if possible (as this uses less space) > 3) add a feature to only index a subset of the keys (only index what we > need) > 4) convert the last x levels of the index structure as a property instead > of as a node > > > 3) and 4) would require changes in Oak. For 4), the change should reduce > the number of nodes, but might cause merge conflicts (not sure). With > level = 1, it would be: > > /content/products/a @color=red > /content/products/b @color=red > > /oak:index/color/red/content > /oak:index/color/red/content/products @a = true, @b = true > > instead of > > /oak:index/color/red/content > /oak:index/color/red/content/products > /oak:index/color/red/content/products/a @match = true > /oak:index/color/red/content/products/b @match = true > > With level > 1, it would require some escaping magic, but we could save > some more nodes, and basically it would be: > > level = 2: > > /oak:index/color/red/content @products_a = true, @products_b = true > > > level = 3: > > /oak:index/color/red @content_products_a = true, @content_products_b = > true > > > > > Regards, > Thomas > > > > > > On 08/07/15 18:18, "Ian Boston" <i...@tfd.co.uk> wrote: > > >Hi, > >I am confused at how /oak:index works and why it is needed in a MongoDB > >setting which has native database indexes that appear to cover the same > >functionality. Could the Oak Query engine use DB indexes directly for all > >indexes that are built into Oak, and Lucene indexes for all custom > >indexes ? > > > >I am asking this because in MongoDB I observe that 60% of the size of the > >nodes collection is attributable to /oak:index, and that the 60% increases > >every non sparse MongoDB index by about 3x. An _id + _modified compound > >index in MongoDB comes out at about 70GB for 100M documents (in part due > >to > >the size of _id). Without the duplication /oak:index it could be closer to > >25GB. Disk space is cheap, but MongoDB working set RAM is not cheap, > >neither is page fault IO. > > > >I fully understand why TarMK needs /oak:index, but I can't understand > >(conceptually) the need to implement an index inside an database table. > >It's like trying to implement an inverted index in an RDBMS table, which > >everyone who has ever tried (or used) that approach doesn't scale nearly > >as > >far as Lucene bitmaps. > > > >Could /oak:index be replaced by something that doesn't generate > >Documents/db rows as fast as it does ? > > > >Best Regards > >Ian > >