Hi, I am not a Jackrabbit developer but a very interested user and co-lead of the PHPCR [1] initiative. I wanted to expand partially on what Ard said about potentially looking into hooking in Solr/ElasticSearch [2] but some other issues I see with full text search in Jackrabbit 2.x
1) scaling Now first up I am overall quite happy with the scalability of Jackrabbit 2.x. Obviously there are two places though where at some point we need to support sharding and that is the persistence manager (which seems to be covered in the current Oak plans) and the lucene index (which doesnt seem to covered). Now imho there are already two perfectly fine projects working on this with Solr (the more natural choice since its also an Apache project) and ElasticSearch (imho it provides a much better API). More over (optionally) leveraging these has several other advantages: - mature products (especially ElasticSearch is very mature when it comes to sharding), supporting them might also attract new users to Jackrabbit - handle much larger data sets via sharding - provide many more full text search specific features - less pressure on Jackrabbit to support these features [3] [4] - as these are both Lucene based the amount of code needed (for example to convert QOM to Solr/ElasticSearch) will be minimal --- 2) facetting Now I mentioned facetting [4] above. Right now Jackrabbit does not even support COUNT() [5], which I find very painful and a major oversight. But really what people have come to expect from full text search is facetting. Imho its so important that it should even be part of JCR 2.1 [6] and as you can see in this link it seems like HippoCMS developers agree that its a very useful feature to have inside Jackrabbit. --- 3) "cleaner" data in results This is actually a fairly trivial issue but with severe implications for scalability. As Ard explained in many cases "a document" will span many nodes. Now when dealing with such a "document" (especially when doing overview pages of a collection of documents) its not always necessary to get the entire tree of nodes. All that is needed are some fields. For this the full text search API could provide a much faster retrieval mechanism. However we have found that the data stored inside the Lucene index is not the original data. It probably makes sense to only store the tokenized version to limit the impact of the issue noted in 1), but the fact that the same separator is used for spaces and multi value fields [7] makes it needlessly hard in many cases to simply leverage the full text search API to fetch subsets of data from a tree of nodes. --- 4) cover more SQL2 functions This is a comparatively minor topic and might just be beyond the scope of this mailinglist which seems to be more about designing the future architecture than "minor" feature requrts. But it would be great to also support PATH(), DEPTH() etc. [8]. --- Now one last comment, I hope that all of you see the potentially in pushing Jackrabbit's user base with the existence of PHPCR. Suddenly it becomes a high scalable database for the entire PHP CMS community. As a matter of fact at DrupalCon Denver this week Drupal tentatively agreed to migrate their storage API to PHPCR. Now this doesnt necessarily need to be limited to PHP even, PHPCR just proofed that JCR isnt as language specific as many proponents of CMIS make it out to be. Heck there is even someone that started to port JCR to Node.js [9] (well its not very active, but hey). My point being here, when thinking about Oak, please also think about the performance of users talking to Jackrabbit via HTTP. The PHPCR team has done its best in trying to solve quite a few performance issues with the current HTTP API, but it would be great of this was really in everyones head. regards, Lukas Kahwe Smith [email protected] [1] http://phpcr.github.com [2] http://www.mail-archive.com/[email protected]/msg00337.html [3] https://issues.apache.org/jira/browse/JCR-3204 [4] https://issues.apache.org/jira/browse/JCR-3134 [5] https://issues.apache.org/jira/browse/JCR-2605 [6] http://java.net/projects/jsr-333/lists/dev/archive/2011-12/message/3 [7] https://issues.apache.org/jira/browse/JCR-3028 [8] https://issues.apache.org/jira/browse/JCR-3145 [9] https://github.com/NoCR/NoCR
