Hi all,
I am resurrecting this thread as I've managed to find some time to start
having a look at how to support faceting in Oak query engine.
One important thing is that I agree with Ard (and I've seen it like that
from the beginning) that since we have Lucene and Solr Oak index
implementations we should rely on them for such advanced features [1][2]
instead of reinventing the wheel.
Within the above assumption the implementation seems quite straightforward.
The not so obvious bits comes when getting to:
- exposing facets within the JCR API
- correctly filtering facets depending on authorization / privileges
For the former here are a quick list of options that came to my mind
(originated also when talking to people f2f about this):
1. conservative approach, do not touch JCR API: facets are retrieved as
custom columns (String values) of a Row (from QueryResult.getRows()), e.g.
row.getValue("facet(jcr:primaryType)")).
2. Oak-only approach, do not touch JCR API but provide utilities which can
retrieve structured facets from the result, e.g. Iterable<Facet> facets =
OakQueryUtils.extractFacets(QueryResult.getRows());
3. not JCR compliant approach, we add methods to the API similarly to what
Ard and AlexK proposed
4. adapter pattern, similarly to what is done in Apache Sling's adaptTo,
where QueryResult can be adapted to different things and therefore it's
more extensible (but less controllable).
Of course other proposals are welcome on this.
For the latter the things seem less simple as I foresee that we want the
facets to be consistent with the result nodes and therefore to be filtered
according to the privileges of the user having issued the query.
Here are the options I could think to so far, even though none looks
satisfactory to me yet:
1. retrieve facets and then filter them afterwards seems to have an
inherent issue because the facets do not include information about the
documents (nodes) which generated them, therefore retrieving them
unfiltered (as the index doesn't have information about ACLs) as they are ,
e.g. facet on jcr:primaryType:
"jcr:primaryType" : {
"nt:unstructured" : 100,
"nt:file" : 20,
"oak:Unstructured" : 10
}
would require to: iterate over the results and filter counts as you iterate
or do N further queries to filter the counts but then it would be useless
to have the facets being returned from the index as we'd be retrieving them
ourselves to do the ACL checks OR other such dummy methods.
2. retrieve the facets unfiltered from the index and then return them in
the filtered results only if there's at least one item (node) in the
(filtered) results which falls under that facet. That would mean that we
would not return the counts of the facets, but a facet would be returned if
there's at least one item in the results belonging to it. While it sounds a
bit not too nice (and a pity as we're loosing some information we have
along the way) Amazon does exactly that (see "Show results for" column on
the left at [3]) :-)
3. use a slightly different mechanism for returning facets, called result
grouping (or field collapsing) in Solr [5], in which results are returned
grouped (and counted) by a certain field. The example of point 1 would look
like:
"grouped":{
"jcr:primaryType":{
"matches": 130,
"groups":[{
"groupValue":"nt:unstructured",
"doclist":{"numFound":100,"start":0,"docs":[
{
"path":"/content/a/b"
}, ...
]
}},
{
"groupValue":"nt:file",
"doclist":{"numFound":20,"start":0,"docs":[
{
"path":"/content/d/e"
}, ...
]
}},
...
there the facets would also contain (some or all of) the docs (nodes)
belonging to each group and therefore filtering the facets afterwards could
be done without having to retrieve the paths of the nodes falling under
each facet.
4. move towards the 'covering index' concept [5] Thomas mentioned in [6]
and incorporate the ACLs in the index so that no further filtering has to
be done once the underlying query index has returned its results. However
this comes with a non trivial impact with regards to a) load of the
indexing on the repo (each time some ACL changes a bunch of index updates
happen) b) complexity in encoding ACLs in the indexed documents c)
complexity in encoding the ACL check in the index-specific queries. Still
this is probably something we may evaluate regardless of facets in the
future as the lazy ACL check approach we have has, IIUTC, the following
issue: userA searching for 'jcr:title=foo', the query engine selecting the
Lucene property index which returns 100 docs, userA being only able to see
2 of them because of its ACLs, in this case we have wasted (approximately)
80% of the Lucene effort to match and return the documents. However this is
most probably overkill for now...
5. another probably crazy idea is user filtered indexes, meaning that the
NodeStates passed to such IndexEditors would be filtered according to what
a configured user (list) can see / read. The obvious disadvantage is the
eventual pollution of such indexes and the consequent repository growth.
6. at query time map the user ACLs to a list of (readable) paths, since
both Lucene and Solr index implementations index the exact path of each
node, such a list may be passed as a "filter query" to be used to find the
subset of nodes that such a user can see, and therefore the results (facets
included) would come already filtered. The questions here are: a) is it
possible to do this mapping at all? b) how much slow would it be? Also the
implementation of that would probably require to encode the paths in a way
that they are shorter and eventually numbers so that search should be
faster.
Again other proposals on authorization are welcome and I'll keep thinking /
inspecting on other approaches too.
Thanks to the brave who could read so far, after so many words a bit of
code of the first PoC of facets based on the Solr index [7]: facets there
are not filtered by ACLs and are returned as columns in
QueryResult.getRows() (JCR API conservative option seemed better to start).
A sample query to retrieve facets would be:
select [jcr:path], [facet(jcr:primaryType)] from [nt:base] where
contains([text, 'oak']);
and the result would look like:
RowIterator rows = query.execute().getResult().getRows();
while (rows.hasNext()) {
Row row = rows.nextRow();
String facetString = row.getValue("facet(jcr:primaryType)"); // -->
jcr:primaryType:[nt:unstructured (100), nt:file (20), oak:Unstructured(10)]
...
}
Looking forward to your comments on the mentioned approaches (and code,
eventually).
Regards,
Tommaso
[1] :
http://lucene.apache.org/core/4_10_2/facet/org/apache/lucene/facet/package-summary.html
[2] : https://cwiki.apache.org/confluence/display/solr/Faceting
[3] :
http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=sony
[4] : https://cwiki.apache.org/confluence/display/solr/Result+Grouping
[5] : http://en.wikipedia.org/wiki/Database_index#Covering_index
[6] : http://markmail.org/message/4i5d55235oo26okl
[7] :
https://github.com/tteofili/jackrabbit-oak/compare/oak-1736a#files_bucket
2014-09-01 13:11 GMT+02:00 Ard Schrijvers <[email protected]>:
> Hey Alex,
>
> On Sat, Aug 30, 2014 at 4:25 AM, Alexander Klimetschek
> <[email protected]> wrote:
> > On 29.08.2014, at 03:10, Ard Schrijvers <[email protected]>
> wrote:
> >
> >> 1) When exposing faceting from Jackrabbit, we wouldn't use virtual
> >> layers any more to expose them over pure JCR spec API's. Instead, we
> >> would extend the jcr QueryResult to have next to getRows/getNodes/etc
> >> also expose for example methods on the QueryResult like
> >>
> >> public Map<String, Integer> getFacetValues(final String facet) {
> >> return result.getFacetValues(facet);
> >> }
> >>
> >> public QueryResult drilldown(final FacetValue facetValue) {
> >> // return current query result drilled down for facet value
> >> return ...
> >> }
> >
> > We actually have a similar API in our CQ/AEM product:
> >
> > Query => represents a query [1]
> > SearchResult result = query.getResult();
> > Map<String, Facet> facets = result.getFacets();
> >
> > A facet is a list of "Buckets" [2] - same as FacetValue above, I assume
> - an abstraction over different values. You could have distinctive values
> (e.g. "red", "green", "blue"), but also ranges ("last year", "last month"
> etc.). Each bucket has a count, i.e. the number of times it occurs in the
> current result.
> >
> > Then on Query you have a method
> >
> > Query refine(Bucket bucket)
> >
> > which is the same as the drilldown above.
> >
> > So in the end it looks pretty much the same, and seems to be a good way
> to represent this as API. Doesn't say much about the implementation yet,
> though :)
>
> It looks very much the same, and I must admit that during typing my
> mail I didn't put too much attention to things like how to name
> something (I reckon that #refine is a much nicer name than the
> drillDown I wrote :-)
>
> >
> >> 2) Authorized counts....for faceting, it doesn't make sense to expose
> >> there are 314 results if you can only read 54 of them. Accounting for
> >> authorization through access manager can be way too slow.
> >> ...
> >> 3) If you support faceting through Oak, will that be competitive
> >> enough to what Solr and Elasticsearch offer? Customers these days have
> >> some expectations on search result quality and faceting capabilities,
> >> performance included.
> >> ...
> >> So, my take would be to invest time in easy integration with
> >> solr/elasticsearch and focus in Oak on the parts (hierarchy,
> >> authorization, merging, versioning) that aren't covered by already
> >> existing frameworks. Perhaps provide an extended JCR API as described
> >> in (1) which under the hood can delegate to a solr or es java client.
> >> In the end, you'll still end up having the authorized counts issue,
> >> but if you make the integration pluggable enough, it might be possible
> >> to leverage domain specific solutions to this (solr/es doesn't do
> >> anything with authorization either, it is a tough nut to crack)
> >
> > Good points. When facets are used, the worst case (showing facets for
> all your content) might actually be the very first thing you see, when
> something like a product search/browse page is shown, before any actual
> search by the user is done. Optimizing for performance right from the start
> is a must, I agree.
> >
> > What I can imagine though, is if you can leverage some kind of caching
> though. In practice, if you have a public site with content that does not
> change permanently, the facet values are pretty much stable, and
> authorization shouldn't cost much.
>
> Certainly there are many use cases where you can cache a lot, or for
> example have a public site user that has read access to an entire
> content tree. It becomes however much more difficult when you want to
> for example expose faceted structure of documents to an editor in a
> cms environment, where the editor has read access to only 1% of the
> documents. If at the same time, her initial query without
> authorization results in, say, 10 million hits, then you'll have to
> authorize all of them to get correct counts. The only way we could
> make this performing with Hippo CMS against jackrabbit was by
> translating our authorization authorization model directly to lucene
> queries and keep caching (authorized) bitsets (slightly different in
> newer lucene versions) in memory for a user, see [1]. The difficulty
> was that even executing the authorization query (to AND with normal
> query) became slow because of very large queries, but fortunately due
> to the jackrabbit 2 index implementation, we could keep a cached
> bitset per indexreader, see [2]. Unfortunately, this solution can only
> be done for specific authoriztion models (which can be mapped to
> lucene queries) and might not be generic enough for oak.
>
> Any way, apart from performance / authorization, I doubt whether oak
> will be able to keep up with what can be leveraged through ES or Solr.
>
> Regards Ard
>
> [1]
> http://svn.onehippo.org/repos/hippo/hippo-cms7/repository/trunk/engine/src/main/java/org/hippoecm/repository/query/lucene/AuthorizationQuery.java
> [2]
> http://www.onehippo.com/en/resources/blogs/2013/01/cms-7.8-nailed-down-authorization-combined-with-searches.html
>
> >
> > [1]
> http://docs.adobe.com/docs/en/aem/6-0/develop/ref/javadoc/com/day/cq/search/Query.html
> > [2]
> http://docs.adobe.com/docs/en/aem/6-0/develop/ref/javadoc/com/day/cq/search/facets/Bucket.html
> >
> > Cheers,
> > Alex
>
>
>
> --
> Amsterdam - Oosteinde 11, 1017 WT Amsterdam
> Boston - 1 Broadway, Cambridge, MA 02142
>
> US +1 877 414 4776 (toll free)
> Europe +31(0)20 522 4466
> www.onehippo.com
>