Re: /oak:index (DocumentNodeStore)

Ian Boston Thu, 09 Jul 2015 03:30:01 -0700

Hi,

On 9 July 2015 at 10:33, Thomas Mueller <[email protected]> wrote:


> Hi,
>
> Using MongoDB indexes directly doesn't work because of the MVCC model.
> What we could do is add special collections (basically one collection per
> index). This would requires some work, which then would need to be
> repeated for RDBMK. It would be quite some work.
>

ok, understood.


>
> > I observe that 60% of the size of the nodes collection is attributable
> >to /oak:index
>
> Could you try to find out which index(es) are responsible for that?


Marcel and Chetan have been working on the repository I was observing. I am
sure they can point you to the details offline, if you are not aware of it
already. They were able to remove about 25% of the 60% under /oak:index,
but IIUC most of the remainder and not local customisations, and perhaps
40% of what remains is not local customisations and must be synchronous,
which indicates a 1:2 ratio between real content nodes and MongoDB
documents before any MongoDB indexes are considered. That ratio was the
motivation for asking the question. Chetan thought I should discuss on
oak-dev.

Marcel and Chetan have executed 0) and 1) below, far more knowledgable than
I in this area.

Best Regards
Ian



> There
> would be multiple ways to reduce the number of nodes:
>
> 0) remove unused indexes
> 1) convert some indexes to Lucene property indexes

2) convert to unique index if possible (as this uses less space)

3) add a feature to only index a subset of the keys (only index what we
> need)
> 4) convert the last x levels of the index structure as a property instead
> of as a node
>
>
> 3) and 4) would require changes in Oak. For 4), the change should reduce
> the number of nodes, but might cause merge conflicts (not sure). With
> level = 1, it would be:
>
>   /content/products/a @color=red
>   /content/products/b @color=red
>
>   /oak:index/color/red/content
>   /oak:index/color/red/content/products @a = true, @b = true
>
> instead of
>
>   /oak:index/color/red/content
>   /oak:index/color/red/content/products
>   /oak:index/color/red/content/products/a @match = true
>   /oak:index/color/red/content/products/b @match = true
>
> With level > 1, it would require some escaping magic, but we could save
> some more nodes, and basically it would be:
>
> level = 2:
>
>   /oak:index/color/red/content @products_a = true, @products_b = true
>
>
> level = 3:
>
>   /oak:index/color/red @content_products_a = true, @content_products_b =
> true
>
>
>
>
> Regards,
> Thomas
>
>
>
>
>
> On 08/07/15 18:18, "Ian Boston" <[email protected]> wrote:
>
> >Hi,
> >I am confused at how /oak:index works and why it is needed in a MongoDB
> >setting which has native database indexes that appear to cover the same
> >functionality. Could the Oak Query engine use DB indexes directly for all
> >indexes that are built into Oak, and Lucene indexes for all custom
> >indexes ?
> >
> >I am asking this because in MongoDB I observe that 60% of the size of the
> >nodes collection is attributable to /oak:index, and that the 60% increases
> >every non sparse MongoDB index by about 3x. An _id + _modified compound
> >index in MongoDB comes out at about 70GB for 100M documents (in part due
> >to
> >the size of _id). Without the duplication /oak:index it could be closer to
> >25GB. Disk space is cheap, but MongoDB working set RAM is not cheap,
> >neither is page fault IO.
> >
> >I fully understand why TarMK needs /oak:index, but I can't understand
> >(conceptually) the need to implement an index inside an database table.
> >It's like trying to implement an inverted index in an RDBMS table, which
> >everyone who has ever tried (or used) that approach doesn't scale nearly
> >as
> >far as Lucene bitmaps.
> >
> >Could /oak:index be replaced by something that doesn't generate
> >Documents/db rows as fast as it does ?
> >
> >Best Regards
> >Ian
>
>

Re: /oak:index (DocumentNodeStore)

Reply via email to