Hello Jukka et al, I've read the entire thread, and below reply inline to the initial proposal of Jukka as I have some doubts in that area:
On Tue, Sep 18, 2012 at 5:14 PM, Jukka Zitting <[email protected]> wrote: > Hi, > <snip/> > > First of all I think there shouldn't be just one single place in the > repository where all index configuration should go. It would be nice > if users and applications could define custom indexes on areas they > have write access to, and having to grant them access to some shared > location for that might be troublesome. > > Instead I'd allow a custom indexes to be defined by adding something > like an oak:indexed mixin type and an associated oak:indexes child > node to any node in the repository. Each child node of that > oak:indexes node would configure an index for the subtree rooted at > that oak:indexed node. Index configuration would be stored as normal > content, and the index content in a hidden :index subtree or elsewhere > depending on the type of the index. Having the Lucene indexes inside the repository is of course really really nice, as currently (jr 2.x), bringing up a new cluster repository node means you first have to index the entire repository to create a *local* FS Lucene index (or actually indexes). That said, of course it is really nice, but, I didn't yet hear of *any* successful Lucene implementation that did not have the Lucene indexes near the computation. Thus having the Lucene indexes in, say some noSQL store or database, pretty much means it will never perform afaiu. Also, I've talked to Simon Willnauer (Lucene chair) a couple of times about these kind of attempts. He says Lucene will *never* perform if the data (indexes) are not near the computation. So, if we want to store the lucene indexes in the oak repository in binary fields, how will they ever be 'near' the computation? OTOH, I must be missing something because I expressed these concerns before to Jukka so he must know something that I don't if he is still confident this will work :) The only way I could imagine we already gain a lot compared to jr 2.x and still have performance is if we have the backing storage contain (and maintain like indexing new nodes) the indexes (just like Jukka suggests), but repository (jvm) instances load the entire index nodes from the repository to local FS. If the repository index is an append only binary (for example append only the binary segments as new binaries to an index just like Lucene does) then perhaps it could perform <snip/> > > When executing a query, the search engine in Oak would then detect all > indexes along the main path axis of a given query. For example, when > querying for content inside /data/foo, the search engine would use the > indexes at / and /data, but not the ones at /articles. And here I think I have my other doubts. For example, Lucene needs the same analyzers query time as were used indexing time. Now, if I would have an English spellchecker for the index at / and a French for the index at /data, then, I cannot see how you could ever query both indexes in one go. Similarly if the index at / indexes title property as String (single token) and the index at /data indexes the title as Text (tokenized). How can you now query the title at / So, I do think it is nice to be able to configure multiple index configuration for different parts of the jcr tree, but I doubt about supporting nested indexes that are backed by different index configuration. Without the nesting, I think it would work. Thus, query for / uses the index for /. Query for /data uses just the index for /data, not the one from / These are my concerns...unfortunately I cannot join the upcoming oak hackathon due to holiday, but otherwise I would have been very interested in the details I don't understand Regards Ard > > Removing a custom index would be a simple matter of removing the > respective index configuration node. For example, to remove the full > text index defined above, one would do: > > Session session = ...; > session.getNode("/data/oak:indexes/fulltext").remove(); > session.save(); > > BR, > > Jukka Zitting
