Re: [DISCUSS] supporting faceting in Oak query engine

Ard Schrijvers Mon, 01 Sep 2014 04:12:40 -0700

Hey Alex,

On Sat, Aug 30, 2014 at 4:25 AM, Alexander Klimetschek
<[email protected]> wrote:
> On 29.08.2014, at 03:10, Ard Schrijvers <[email protected]> wrote:
>
>> 1) When exposing faceting from Jackrabbit, we wouldn't use virtual
>> layers any more to expose them over pure JCR spec API's. Instead, we
>> would extend the jcr QueryResult to have next to getRows/getNodes/etc
>> also expose for example methods on the QueryResult like
>>
>> public Map<String, Integer> getFacetValues(final String facet) {
>>      return result.getFacetValues(facet);
>> }
>>
>> public QueryResult drilldown(final FacetValue facetValue) {
>>        // return current query result drilled down for facet value
>>        return ...
>> }
>
> We actually have a similar API in our CQ/AEM product:
>
> Query => represents a query [1]
> SearchResult result = query.getResult();
> Map<String, Facet> facets = result.getFacets();
>
> A facet is a list of "Buckets" [2] - same as FacetValue above, I assume - an 
> abstraction over different values. You could have distinctive values (e.g. 
> "red", "green", "blue"), but also ranges ("last year", "last month" etc.). 
> Each bucket has a count, i.e. the number of times it occurs in the current 
> result.
>
> Then on Query you have a method
>
> Query refine(Bucket bucket)
>
> which is the same as the drilldown above.
>
> So in the end it looks pretty much the same, and seems to be a good way to 
> represent this as API. Doesn't say much about the implementation yet, though 
> :)


It looks very much the same, and I must admit that during typing my
mail I didn't put too much attention to things like how to name
something (I reckon that #refine is a much nicer name than the
drillDown I wrote :-)

>
>> 2) Authorized counts....for faceting, it doesn't make sense to expose
>> there are 314 results if you can only read 54 of them. Accounting for
>> authorization through access manager can be way too slow.
>> ...
>> 3) If you support faceting through Oak, will that be competitive
>> enough to what Solr and Elasticsearch offer? Customers these days have
>> some expectations on search result quality and faceting capabilities,
>> performance included.
>> ...
>> So, my take would be to invest time in easy integration with
>> solr/elasticsearch and focus in Oak on the parts (hierarchy,
>> authorization, merging, versioning) that aren't covered by already
>> existing frameworks. Perhaps provide an extended JCR API as described
>> in (1) which under the hood can delegate to a solr or es java client.
>> In the end, you'll still end up having the authorized counts issue,
>> but if you make the integration pluggable enough, it might be possible
>> to leverage domain specific solutions to this (solr/es doesn't do
>> anything with authorization either, it is a tough nut to crack)
>
> Good points. When facets are used, the worst case (showing facets for all 
> your content) might actually be the very first thing you see, when something 
> like a product search/browse page is shown, before any actual search by the 
> user is done. Optimizing for performance right from the start is a must, I 
> agree.
>
> What I can imagine though, is if you can leverage some kind of caching 
> though. In practice, if you have a public site with content that does not 
> change permanently, the facet values are pretty much stable, and 
> authorization shouldn't cost much.

Certainly there are many use cases where you can cache a lot, or for
example have a public site user that has read access to an entire
content tree. It becomes however much more difficult when you want to
for example expose faceted structure of documents to an editor in a
cms environment, where the editor has read access to only 1% of the
documents. If at the same time, her initial query without
authorization results in, say, 10 million hits, then you'll have to
authorize all of them to get correct counts. The only way we could
make this performing with Hippo CMS against jackrabbit was by
translating our authorization authorization model directly to lucene
queries and keep caching (authorized) bitsets (slightly different in
newer lucene versions) in memory for a user, see [1]. The difficulty
was that even executing the authorization query (to AND with normal
query) became slow because of very large queries, but fortunately due
to the jackrabbit 2 index implementation, we could keep a cached
bitset per indexreader, see [2]. Unfortunately, this solution can only
be done for specific authoriztion models (which can be mapped to
lucene queries) and might not be generic enough for oak.

Any way, apart from performance / authorization, I doubt whether oak
will be able to keep up with what can be leveraged through ES or Solr.

Regards Ard

[1] 
http://svn.onehippo.org/repos/hippo/hippo-cms7/repository/trunk/engine/src/main/java/org/hippoecm/repository/query/lucene/AuthorizationQuery.java
[2] 
http://www.onehippo.com/en/resources/blogs/2013/01/cms-7.8-nailed-down-authorization-combined-with-searches.html

>
> [1] 
> http://docs.adobe.com/docs/en/aem/6-0/develop/ref/javadoc/com/day/cq/search/Query.html
> [2] 
> http://docs.adobe.com/docs/en/aem/6-0/develop/ref/javadoc/com/day/cq/search/facets/Bucket.html
>
> Cheers,
> Alex



-- 
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

Re: [DISCUSS] supporting faceting in Oak query engine

Reply via email to