full text search improvements

Lukas Kahwe Smith Sat, 24 Mar 2012 07:13:17 -0700

Hi,

I am not a Jackrabbit developer but a very interested user and co-lead of the 
PHPCR [1] initiative.
I wanted to expand partially on what Ard said about potentially looking into 
hooking in Solr/ElasticSearch [2] but some other issues I see with full text 
search in Jackrabbit 2.x


1) scaling

Now first up I am overall quite happy with the scalability of Jackrabbit 2.x.
Obviously there are two places though where at some point we need to support 
sharding and that is the persistence manager (which seems to be covered in the 
current Oak plans) and the lucene index (which doesnt seem to covered). Now 
imho there are already two perfectly fine projects working on this with Solr 
(the more natural choice since its also an Apache project) and ElasticSearch 
(imho it provides a much better API).

More over (optionally) leveraging these has several other advantages:
- mature products (especially ElasticSearch is very mature when it comes to 
sharding), supporting them might also attract new users to Jackrabbit
- handle much larger data sets via sharding
- provide many more full text search specific features
- less pressure on Jackrabbit to support these features [3] [4]
- as these are both Lucene based the amount of code needed (for example to 
convert QOM to Solr/ElasticSearch) will be minimal

---

2) facetting

Now I mentioned facetting [4] above. Right now Jackrabbit does not even support 
COUNT() [5], which I find very painful and a major oversight. But really what 
people have come to expect from full text search is facetting. Imho its so 
important that it should even be part of JCR 2.1 [6] and as you can see in this 
link it seems like HippoCMS developers agree that its a very useful feature to 
have inside Jackrabbit.

---

3) "cleaner" data in results

This is actually a fairly trivial issue but with severe implications for 
scalability. As Ard explained in many cases "a document" will span many nodes. 
Now when dealing with such a "document" (especially when doing overview pages 
of a collection of documents) its not always necessary to get the entire tree 
of nodes. All that is needed are some fields. For this the full text search API 
could provide a much faster retrieval mechanism. However we have found that the 
data stored inside the Lucene index is not the original data. It probably makes 
sense to only store the tokenized version to limit the impact of the issue 
noted in 1), but the fact that the same separator is used for spaces and multi 
value fields [7] makes it needlessly hard in many cases to simply leverage the 
full text search API to fetch subsets of data from a tree of nodes.

---

4) cover more SQL2 functions

This is a comparatively minor topic and might just be beyond the scope of this 
mailinglist which seems to be more about designing the future architecture than 
"minor" feature requrts. But it would be great to also support PATH(), DEPTH() 
etc. [8].

---

Now one last comment, I hope that all of you see the potentially in pushing 
Jackrabbit's user base with the existence of PHPCR. Suddenly it becomes a high 
scalable database for the entire PHP CMS community. As a matter of fact at 
DrupalCon Denver this week Drupal tentatively agreed to migrate their storage 
API to PHPCR. Now this doesnt necessarily need to be limited to PHP even, PHPCR 
just proofed that JCR isnt as language specific as many proponents of CMIS make 
it out to be. Heck there is even someone that started to port JCR to Node.js 
[9] (well its not very active, but hey).

My point being here, when thinking about Oak, please also think about the 
performance of users talking to Jackrabbit via HTTP. The PHPCR team has done 
its best in trying to solve quite a few performance issues with the current 
HTTP API, but it would be great of this was really in everyones head.

regards,
Lukas Kahwe Smith
[email protected]

[1] http://phpcr.github.com
[2] http://www.mail-archive.com/[email protected]/msg00337.html
[3] https://issues.apache.org/jira/browse/JCR-3204
[4] https://issues.apache.org/jira/browse/JCR-3134
[5] https://issues.apache.org/jira/browse/JCR-2605
[6] http://java.net/projects/jsr-333/lists/dev/archive/2011-12/message/3
[7] https://issues.apache.org/jira/browse/JCR-3028
[8] https://issues.apache.org/jira/browse/JCR-3145
[9] https://github.com/NoCR/NoCR

full text search improvements

Reply via email to