On Thu, Mar 22, 2012 at 9:36 AM, Thomas Mueller <[email protected]> wrote: > Hi, > >>OAK-36 covers the Query implementation effort, but I'm wondering if now >>would be a good time to mention indexing as well. >> >>We want to have dedicated indexes, I think that would be accomplished via >>observation. >>Any ideas about the availability of this feature? > > Sure. One such a mechanism is implemented, and currently lives under > org.apache.jackrabbit.mk.index. It is not yet "wired" to > org.apache.jackrabbit.oak.query.index. This mechanism stores the index > data in nodes and properties, as a tree (using just the MicroKernel API). > This mechanism is supposed to be as scalable as the MicroKernel > implementation (support concurrent writes if the MicroKernel > implementation supports it). > >>The current index implementation just traverses the existing nodes (albeit >>applying some path constraints first), > > Yes, that's org.apache.jackrabbit.oak.query.index.TraversingReader > >>This helps with testing the query parser & friends, but a lucene based >>query engine needs events to update its data. > > Given the scalability requirements defined at [1] (specially concurrent, > scalable writes in multiple cluster nodes) we plan to support other > (non-Lucene) index mechanisms as well. Personally, I believe we should use > Lucene for fulltext indexing, because that's what Lucene is meant for. But > I'm not sure how a fully scalable fulltext index using Lucene would look > like. That's still an open question we need to resolve, or define the > limitations in this area.
I'd opt for not implementing a fulltext search index at all in the repository, but rather have some good places to hook in an 'external' index. I should had written my/our (Hippo) use cases already in a mail before but never got to it. I've come to believe, that free text search / full text indexing is too domain specific to be caught in a generic one fits all solution. Imo, full text indexing is very much related to how your 'domain model' is mapped to jcr nodes. A generic repository full text index will index jcr nodes, while, for example at Hippo, we are interested in indexing 'documents' : A document can be some small bonzai tree of nodes. I know there has been made attempts for indexing_configuration kind of tuning, but, imho, it just does not work that well. Also, the jr indexes are quite inefficient in general : In our case, for just a couple of hundreds of thousands of documents, the number of jcr nodes easily exceeds many millions: The (Lucene / full text) indexes are much bigger than needed. For the current jr 2 indexes, it is also the case that pretty much every string property gets stored in the index as well, to do a 'equals' : If for oak, the equality checks are done against a different (node index) instead of Lucene, it will be very hard to combine the results. Although I am on thin ice here, I think there are hardly any noSQL stores out there that actually include full text indexes. I think we shouldn't try to address it in the repository, but rather provide some tooling to easily setup a (external) full text index (like plain Lucene, or use Solr/Elastic search) according someones exact needs (like, which analyzer to use for which part of the content, which properties should be stored, which properties should be analyzed in which ways, which properties are meant for TrieRanges, etc etc) Regards Ard > > [1]: > http://wiki.apache.org/jackrabbit/Goals%20and%20non%20goals%20for%20Jackrab > bit%203 > > Regards, > Thomas >
