2014-12-08 8:15 GMT+01:00 Thomas Mueller <[email protected]>:
> Hi,
>
> I think we should do:
>
>
> > 1. conservative approach, do not touch JCR API
>
>
> > select [jcr:path], [facet(jcr:primaryType)] from [nt:base]
> > where contains([text, 'oak']);
>
> The column "facet(jcr:primaryType)" would return the facet data. I think
> that's a good approach. The question is, which rows would return that
> data. I would prefer a solution where _each_ row returns the data (and not
> just the first row), because that's a bit easier to use, easier to
> document, and more closely matches the relational model. If just the first
> row returns the facet data, then we can't sort the result afterwards
> (otherwise the facet data ends up in another row, which would be weird).
>
sure, I see this point, while it didn't in the first impl me and Thomas
discussed offline, the current PoC does exactly that (can return the facets
via row.getColumnValue("facet(jcr:primaryType)") for each row).
>
> Another approach is to extend the API (create a new interface, for example
> OakQuery). The JDBC API (but not the JCR API) has a concept of multiple
> result sets per query (Statement.getMoreResults). We could build a
> solution that more closely matches this model. But I don't think it's
> worth the trouble right now (we could still do that later on if really
> needed).
>
I think that for an end user to leverage facets easily what you propose
would really make things nicer, of course there's no hurry in defining
that, at least until we have a satisfactory facets implementation.
>
> About security, I wonder what are the common configurations. I think we
> should avoid a complex (but slow, and hard to implement) solution that can
> solve 100% of all possible _theoretical_ cases, but instead go for a
> (faster, simpler) solution that covers 99% of all _pratical_ cases.
>
if I think to the simplest usecases I see:
- a publicly available website where users can search without logging in
- a website where logged in users can search on some content
both would require the results and facets to be filtered on the content a
logged in user or an "anonymous" user can see.
Perhaps we may also have a use case where the website expose content
crawled from the Web (e.g. Google) where there's no filtering on content,
maybe just a personalized ranking (but that's a different story that
doesn't belong here).
@Micheal, Laurie: for filtering out the counts, as I said I'd prefer not to
do that because it's an interesting piece of information we would loose,
what we may do is making that inclusion/exclusion configurable either in
the query index definition node or at runtime somehow within the query
depending on the client needs.
@Laurie for the option #5 that would mean we would have query indexes which
can index and query only data a configured user can see, e.g. we have an
'anonymous-lucene' index being a Lucene index that will only be able to
index nodes the user "anonymous" can see (has jcr:read privilege on), and
that will be used only for queries issued by the user "anonymous", however
as I said I am not sure that's a good idea, because that may not scale (if
you want to define 100 users, you would have 100 Lucene indexes dedicated
to 100 different users).
Regards,
Tommaso
>
>
> Regards,
> Thomas
>
>
>
>
> On 05/12/14 12:13, "Tommaso Teofili" <[email protected]> wrote:
>
> >Hi all,
> >
> >I am resurrecting this thread as I've managed to find some time to start
> >having a look at how to support faceting in Oak query engine.
> >
> >One important thing is that I agree with Ard (and I've seen it like that
> >from the beginning) that since we have Lucene and Solr Oak index
> >implementations we should rely on them for such advanced features [1][2]
> >instead of reinventing the wheel.
> >
> >Within the above assumption the implementation seems quite
> >straightforward.
> >The not so obvious bits comes when getting to:
> >- exposing facets within the JCR API
> >- correctly filtering facets depending on authorization / privileges
> >
> >For the former here are a quick list of options that came to my mind
> >(originated also when talking to people f2f about this):
> >1. conservative approach, do not touch JCR API: facets are retrieved as
> >custom columns (String values) of a Row (from QueryResult.getRows()), e.g.
> >row.getValue("facet(jcr:primaryType)")).
> >2. Oak-only approach, do not touch JCR API but provide utilities which can
> >retrieve structured facets from the result, e.g. Iterable<Facet> facets =
> >OakQueryUtils.extractFacets(QueryResult.getRows());
> >3. not JCR compliant approach, we add methods to the API similarly to what
> >Ard and AlexK proposed
> >4. adapter pattern, similarly to what is done in Apache Sling's adaptTo,
> >where QueryResult can be adapted to different things and therefore it's
> >more extensible (but less controllable).
> >Of course other proposals are welcome on this.
> >
> >For the latter the things seem less simple as I foresee that we want the
> >facets to be consistent with the result nodes and therefore to be filtered
> >according to the privileges of the user having issued the query.
> >Here are the options I could think to so far, even though none looks
> >satisfactory to me yet:
> >
> >1. retrieve facets and then filter them afterwards seems to have an
> >inherent issue because the facets do not include information about the
> >documents (nodes) which generated them, therefore retrieving them
> >unfiltered (as the index doesn't have information about ACLs) as they are
> >,
> >e.g. facet on jcr:primaryType:
> >
> >"jcr:primaryType" : {
> > "nt:unstructured" : 100,
> > "nt:file" : 20,
> > "oak:Unstructured" : 10
> >}
> >
> >would require to: iterate over the results and filter counts as you
> >iterate
> >or do N further queries to filter the counts but then it would be useless
> >to have the facets being returned from the index as we'd be retrieving
> >them
> >ourselves to do the ACL checks OR other such dummy methods.
> >
> >2. retrieve the facets unfiltered from the index and then return them in
> >the filtered results only if there's at least one item (node) in the
> >(filtered) results which falls under that facet. That would mean that we
> >would not return the counts of the facets, but a facet would be returned
> >if
> >there's at least one item in the results belonging to it. While it sounds
> >a
> >bit not too nice (and a pity as we're loosing some information we have
> >along the way) Amazon does exactly that (see "Show results for" column on
> >the left at [3]) :-)
> >
> >3. use a slightly different mechanism for returning facets, called result
> >grouping (or field collapsing) in Solr [5], in which results are returned
> >grouped (and counted) by a certain field. The example of point 1 would
> >look
> >like:
> >
> >"grouped":{
> > "jcr:primaryType":{
> > "matches": 130,
> > "groups":[{
> > "groupValue":"nt:unstructured",
> > "doclist":{"numFound":100,"start":0,"docs":[
> > {
> > "path":"/content/a/b"
> > }, ...
> > ]
> > }},
> > {
> > "groupValue":"nt:file",
> > "doclist":{"numFound":20,"start":0,"docs":[
> > {
> > "path":"/content/d/e"
> > }, ...
> > ]
> > }},
> >...
> >
> >there the facets would also contain (some or all of) the docs (nodes)
> >belonging to each group and therefore filtering the facets afterwards
> >could
> >be done without having to retrieve the paths of the nodes falling under
> >each facet.
> >
> >4. move towards the 'covering index' concept [5] Thomas mentioned in [6]
> >and incorporate the ACLs in the index so that no further filtering has to
> >be done once the underlying query index has returned its results. However
> >this comes with a non trivial impact with regards to a) load of the
> >indexing on the repo (each time some ACL changes a bunch of index updates
> >happen) b) complexity in encoding ACLs in the indexed documents c)
> >complexity in encoding the ACL check in the index-specific queries. Still
> >this is probably something we may evaluate regardless of facets in the
> >future as the lazy ACL check approach we have has, IIUTC, the following
> >issue: userA searching for 'jcr:title=foo', the query engine selecting the
> >Lucene property index which returns 100 docs, userA being only able to see
> >2 of them because of its ACLs, in this case we have wasted (approximately)
> >80% of the Lucene effort to match and return the documents. However this
> >is
> >most probably overkill for now...
> >
> >5. another probably crazy idea is user filtered indexes, meaning that the
> >NodeStates passed to such IndexEditors would be filtered according to what
> >a configured user (list) can see / read. The obvious disadvantage is the
> >eventual pollution of such indexes and the consequent repository growth.
> >
> >6. at query time map the user ACLs to a list of (readable) paths, since
> >both Lucene and Solr index implementations index the exact path of each
> >node, such a list may be passed as a "filter query" to be used to find the
> >subset of nodes that such a user can see, and therefore the results
> >(facets
> >included) would come already filtered. The questions here are: a) is it
> >possible to do this mapping at all? b) how much slow would it be? Also the
> >implementation of that would probably require to encode the paths in a way
> >that they are shorter and eventually numbers so that search should be
> >faster.
> >
> >Again other proposals on authorization are welcome and I'll keep thinking
> >/
> >inspecting on other approaches too.
> >
> >Thanks to the brave who could read so far, after so many words a bit of
> >code of the first PoC of facets based on the Solr index [7]: facets there
> >are not filtered by ACLs and are returned as columns in
> >QueryResult.getRows() (JCR API conservative option seemed better to
> >start).
> >A sample query to retrieve facets would be:
> >
> >select [jcr:path], [facet(jcr:primaryType)] from [nt:base] where
> >contains([text, 'oak']);
> >
> >and the result would look like:
> >
> >RowIterator rows = query.execute().getResult().getRows();
> >while (rows.hasNext()) {
> > Row row = rows.nextRow();
> > String facetString = row.getValue("facet(jcr:primaryType)"); // -->
> >jcr:primaryType:[nt:unstructured (100), nt:file (20),
> >oak:Unstructured(10)]
> > ...
> >}
> >
> >Looking forward to your comments on the mentioned approaches (and code,
> >eventually).
> >Regards,
> >Tommaso
> >
> >[1] :
> >
> http://lucene.apache.org/core/4_10_2/facet/org/apache/lucene/facet/package
> >-summary.html
> >[2] : https://cwiki.apache.org/confluence/display/solr/Faceting
> >[3] :
> >
> http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywor
> >ds=sony
> >[4] : https://cwiki.apache.org/confluence/display/solr/Result+Grouping
> >[5] : http://en.wikipedia.org/wiki/Database_index#Covering_index
> >[6] : http://markmail.org/message/4i5d55235oo26okl
> >[7] :
> >https://github.com/tteofili/jackrabbit-oak/compare/oak-1736a#files_bucket
> >
> >2014-09-01 13:11 GMT+02:00 Ard Schrijvers <[email protected]>:
> >
> >> Hey Alex,
> >>
> >> On Sat, Aug 30, 2014 at 4:25 AM, Alexander Klimetschek
> >> <[email protected]> wrote:
> >> > On 29.08.2014, at 03:10, Ard Schrijvers <[email protected]>
> >> wrote:
> >> >
> >> >> 1) When exposing faceting from Jackrabbit, we wouldn't use virtual
> >> >> layers any more to expose them over pure JCR spec API's. Instead, we
> >> >> would extend the jcr QueryResult to have next to getRows/getNodes/etc
> >> >> also expose for example methods on the QueryResult like
> >> >>
> >> >> public Map<String, Integer> getFacetValues(final String facet) {
> >> >> return result.getFacetValues(facet);
> >> >> }
> >> >>
> >> >> public QueryResult drilldown(final FacetValue facetValue) {
> >> >> // return current query result drilled down for facet value
> >> >> return ...
> >> >> }
> >> >
> >> > We actually have a similar API in our CQ/AEM product:
> >> >
> >> > Query => represents a query [1]
> >> > SearchResult result = query.getResult();
> >> > Map<String, Facet> facets = result.getFacets();
> >> >
> >> > A facet is a list of "Buckets" [2] - same as FacetValue above, I
> >>assume
> >> - an abstraction over different values. You could have distinctive
> >>values
> >> (e.g. "red", "green", "blue"), but also ranges ("last year", "last
> >>month"
> >> etc.). Each bucket has a count, i.e. the number of times it occurs in
> >>the
> >> current result.
> >> >
> >> > Then on Query you have a method
> >> >
> >> > Query refine(Bucket bucket)
> >> >
> >> > which is the same as the drilldown above.
> >> >
> >> > So in the end it looks pretty much the same, and seems to be a good
> >>way
> >> to represent this as API. Doesn't say much about the implementation yet,
> >> though :)
> >>
> >> It looks very much the same, and I must admit that during typing my
> >> mail I didn't put too much attention to things like how to name
> >> something (I reckon that #refine is a much nicer name than the
> >> drillDown I wrote :-)
> >>
> >> >
> >> >> 2) Authorized counts....for faceting, it doesn't make sense to expose
> >> >> there are 314 results if you can only read 54 of them. Accounting for
> >> >> authorization through access manager can be way too slow.
> >> >> ...
> >> >> 3) If you support faceting through Oak, will that be competitive
> >> >> enough to what Solr and Elasticsearch offer? Customers these days
> >>have
> >> >> some expectations on search result quality and faceting capabilities,
> >> >> performance included.
> >> >> ...
> >> >> So, my take would be to invest time in easy integration with
> >> >> solr/elasticsearch and focus in Oak on the parts (hierarchy,
> >> >> authorization, merging, versioning) that aren't covered by already
> >> >> existing frameworks. Perhaps provide an extended JCR API as described
> >> >> in (1) which under the hood can delegate to a solr or es java client.
> >> >> In the end, you'll still end up having the authorized counts issue,
> >> >> but if you make the integration pluggable enough, it might be
> >>possible
> >> >> to leverage domain specific solutions to this (solr/es doesn't do
> >> >> anything with authorization either, it is a tough nut to crack)
> >> >
> >> > Good points. When facets are used, the worst case (showing facets for
> >> all your content) might actually be the very first thing you see, when
> >> something like a product search/browse page is shown, before any actual
> >> search by the user is done. Optimizing for performance right from the
> >>start
> >> is a must, I agree.
> >> >
> >> > What I can imagine though, is if you can leverage some kind of caching
> >> though. In practice, if you have a public site with content that does
> >>not
> >> change permanently, the facet values are pretty much stable, and
> >> authorization shouldn't cost much.
> >>
> >> Certainly there are many use cases where you can cache a lot, or for
> >> example have a public site user that has read access to an entire
> >> content tree. It becomes however much more difficult when you want to
> >> for example expose faceted structure of documents to an editor in a
> >> cms environment, where the editor has read access to only 1% of the
> >> documents. If at the same time, her initial query without
> >> authorization results in, say, 10 million hits, then you'll have to
> >> authorize all of them to get correct counts. The only way we could
> >> make this performing with Hippo CMS against jackrabbit was by
> >> translating our authorization authorization model directly to lucene
> >> queries and keep caching (authorized) bitsets (slightly different in
> >> newer lucene versions) in memory for a user, see [1]. The difficulty
> >> was that even executing the authorization query (to AND with normal
> >> query) became slow because of very large queries, but fortunately due
> >> to the jackrabbit 2 index implementation, we could keep a cached
> >> bitset per indexreader, see [2]. Unfortunately, this solution can only
> >> be done for specific authoriztion models (which can be mapped to
> >> lucene queries) and might not be generic enough for oak.
> >>
> >> Any way, apart from performance / authorization, I doubt whether oak
> >> will be able to keep up with what can be leveraged through ES or Solr.
> >>
> >> Regards Ard
> >>
> >> [1]
> >>
> >>
> http://svn.onehippo.org/repos/hippo/hippo-cms7/repository/trunk/engine/sr
> >>c/main/java/org/hippoecm/repository/query/lucene/AuthorizationQuery.java
> >> [2]
> >>
> >>
> http://www.onehippo.com/en/resources/blogs/2013/01/cms-7.8-nailed-down-au
> >>thorization-combined-with-searches.html
> >>
> >> >
> >> > [1]
> >>
> >>
> http://docs.adobe.com/docs/en/aem/6-0/develop/ref/javadoc/com/day/cq/sear
> >>ch/Query.html
> >> > [2]
> >>
> >>
> http://docs.adobe.com/docs/en/aem/6-0/develop/ref/javadoc/com/day/cq/sear
> >>ch/facets/Bucket.html
> >> >
> >> > Cheers,
> >> > Alex
> >>
> >>
> >>
> >> --
> >> Amsterdam - Oosteinde 11, 1017 WT Amsterdam
> >> Boston - 1 Broadway, Cambridge, MA 02142
> >>
> >> US +1 877 414 4776 (toll free)
> >> Europe +31(0)20 522 4466
> >> www.onehippo.com
> >>
>
>