Hi, On Wed, Sep 19, 2012 at 9:30 PM, Ard Schrijvers <[email protected]> wrote: > I've read the entire thread, and below reply inline to the initial > proposal of Jukka as I have some doubts in that area:
Great comments, thanks for joining the discussion! > The only way I could imagine we already gain a lot compared to jr 2.x > and still have performance is if we have the backing storage contain > (and maintain like indexing new nodes) the indexes (just like Jukka > suggests), but repository (jvm) instances load the entire index nodes > from the repository to local FS. If the repository index is an append > only binary (for example append only the binary segments as new > binaries to an index just like Lucene does) then perhaps it could > perform That's the idea. All frequently accessed binaries can and should be kept locally, which should make the index perform pretty well. This isn't implemented yet (currently the LuceneIndex simply reads all index binaries to memory...), so there still is no way to benchmark the idea in practice. But at least from a design perspective I don't see any major reasons why this solution couldn't perform at least reasonably close to what Lucene achieves when directly accessing a local file system. > And here I think I have my other doubts. For example, Lucene needs the > same analyzers query time as were used indexing time. Now, if I would > have an English spellchecker for the index at / and a French for the > index at /data, then, I cannot see how you could ever query both > indexes in one go. Similarly if the index at / indexes title property > as String (single token) and the index at /data indexes the title as > Text (tokenized). How can you now query the title at / The index at / indexes content from the entire tree, also from within /data. The fact that there's an extra index at /data wouldn't affect the index at / in any way. Therefore you can still easily query for title at / in English and get correct results also from within /data. > So, I do think it is nice to be able to configure multiple index > configuration for different parts of the jcr tree, but I doubt about > supporting nested indexes that are backed by different index > configuration. Without the nesting, I think it would work. As mentioned above, the idea is not for the indexes to be nested. (I previously toyed with the idea of a hierarchical map-reduce -like mechanism for building an index incrementally across the whole tree, but that's a different discussion and probably won't be implemented unless there's some particular use case for something like that.) > Thus, query for / uses the index for /. Query for /data uses just > the index for /data, not the one from / The index selection process is a bit more complicated than that. Basically for each query we'd look up all the potentially applicable indexes, and then each index is asked to estimate how efficiently it could execute a given query, for example /jcr:root/data//*[@title='foo']. The index at / would notice that it does keep track of the title property so it can do a property constraint pretty efficiently, but probably won't be that fast in evaluating the path constraint. The index at /data on the other hand could do both constraints efficiently, so the query engine will pick that one. On the other hand, if the query was about some other property, like /jcr:root/data//*[@author='bar'], and that property is only indexed at /, then that index would likely get selected by the query engine over the one at /data. BR, Jukka Zitting
