Hi, It is (hopefully) not the case that each content update contains a sync property update. It would be nice to have some statistics for that. The idea was that only few properties should be indexed, so that most changes to the repository don't require index updates. I understand the number of nodes for property indexes is large, because the "content mirror strategy" of the property index creates a copy of the hierarchy of the indexed value. That means lots of intermediate nodes. Example:
/content/a/b/c/x/@color = red /content/a/b/c/y/@color = blue results in /oak:index/color/red/a/b/c/x/@match=true /oak:index/color/blue/a/b/c/y/@match=true It would be nice to reduce the number of nodes in the property index. For example by flattening the hierarchy: /oak:index/color/red/a.b.c.x/@match=true /oak:index/color/blue/a.b.c.y/@match=true This would reduce the number of nodes (instead of 4 nodes "./a/b/c/x" there would only be 1 node "./a.b.c.x"). That's just an example on what we could do. >why are we storing the indexes in the repository itself? With Jackrabbit 2.x, we stored the (Lucene) index somewhere else, which lead to problems: in case of crash, the index was not at the same state as the repository. Reducing the number of "storage backends" simplifies the architecture. With Oak, we have two backends: datastore, and nodestore. I think that's a good architecture. Even simpler would be to only use one backend, but I think using a special mechanism for binaries is fine. >Or better, sharded indexes with local indexes managed independently (each >committing to memory not disk with a WAL to deal with failures) so the >cost >of indexing is parallelised and can scale horizontally... which is one >step >beyond the Hybrid Index proposal. That would be more complex, not sure how much it would help in reality. Regards, Thomas
