Re: reindex improvements

Thomas Mueller Tue, 26 Aug 2014 06:41:40 -0700

Hi,

Did we already run into this problem in reality? How much of a pain point
is it? I think creating indexes is a "maintenance job", which doesn't need
to be done very often, comparable to creating a backup. If creating the
index is asynchronous, then it's OK if it's slow. Re-indexing (re-building
an existing index) should only be needed if there is a bug in the indexing
code.

If we really want to support it (not sure if it's worth it), I see two
main options:

* Defining a path filter in the index is an option, but I would probably
call it just "paths" and not "reindexPaths". Such an index would only be
used if the query is restricted to one of the paths.

* We could define indexes in a subtree. We discussed that a while back,
and indeed we already have some code for it. Right now, all indexes are
stored under "/oak:index/...". If you want to index only "/content/", then
the index could be stored under "/content/oak:index" (for example).
However, there are some problems: finding such an index requires that the
given subtree is read when running the query. Also, defining access rights
for those indexes is not trivial. Even thought it has some advantages, it
also has disadvantages.

Regards,
Thomas

On 26/08/14 12:04, "Davide Giannella" <[email protected]> wrote:

>Hello team,
>
>when we issue the reindex by changing the index definition with
>`reindex=true` the algorithm scan all the repository and issue the "node
>modified/added" to the specified index.
>
>While this works with small repositories it doesn't really scale with
>big ones.
>
>So for taking an extreme example, we have 2 millions node repository
>with only 1 node with the required property. The reindex will keep going
>for as long the 2m node have not been scanned. And with very active
>repositories where we changes a lot of nodes, manually or not, we could
>virtually have an endless reindexing.
>
>Based on my experience with content repositories normally clients are
>interested in querying only parts of it. For example /content.
>
>I was thinking that it could be a good added value if we could add an
>additional property to the index definition: reindexPaths (multivalue,
>String).
>
>When this property is specified, the reindex will happens only on those
>paths in the order as they are specified and it could potentially makes
>the currently indexed content available to the query engine for
>returning partial results when every path is completed.
>
>A single path could be just path or a glob/regex. I'm for using a java
>regex as it gives the end user a lot of power on fine tuning but on the
>other hand regex evaluation is pretty slow...
>
>thoughts?
>
>Cheers
>Davide
>
>
>

Re: reindex improvements

Reply via email to