Hi,

It is (hopefully) not the case that each content update contains a sync
property update. It would be nice to have some statistics for that. The
idea was that only few properties should be indexed, so that most changes
to the repository don't require index updates. I understand the number of
nodes for property indexes is large, because the "content mirror strategy"
of the property index creates a copy of the hierarchy of the indexed
value. That means lots of intermediate nodes. Example:

  /content/a/b/c/x/@color = red
  /content/a/b/c/y/@color = blue

results in

  /oak:index/color/red/a/b/c/x/@match=true
  /oak:index/color/blue/a/b/c/y/@match=true

It would be nice to reduce the number of nodes in the property index. For
example by flattening the hierarchy:

  /oak:index/color/red/a.b.c.x/@match=true
  /oak:index/color/blue/a.b.c.y/@match=true

This would reduce the number of nodes (instead of 4 nodes "./a/b/c/x"
there would only be 1 node "./a.b.c.x"). That's just an example on what we
could do.




>why are we storing the indexes in the repository itself?

With Jackrabbit 2.x, we stored the (Lucene) index somewhere else, which
lead to problems: in case of crash, the index was not at the same state as
the repository.

Reducing the number of "storage backends" simplifies the architecture.
With Oak, we have two backends: datastore, and nodestore. I think that's a
good architecture. Even simpler would be to only use one backend, but I
think using a special mechanism for binaries is fine.

>Or better, sharded indexes with local indexes managed independently (each
>committing to memory not disk with a WAL to deal with failures) so the
>cost
>of indexing is parallelised and can scale horizontally... which is one
>step
>beyond the Hybrid Index proposal.

That would be more complex, not sure how much it would help in reality.

Regards,
Thomas

Reply via email to