On Wed, Dec 10, 2014 at 9:32 AM, Davide Giannella <[email protected]> wrote: > On 09/12/2014 17:10, Michael Marth wrote: >> ... >> >> The use cases problematic case for counting the facets I have in mind are >> when a query returns millions of results. This is problematic when one wants >> to retrieve the exact size of the result set (taking ACLs into account, >> obviously). When facets are to be retrieved this will be an even harder >> problem (meaning when the exact number is to be calculated per facet). >> As an illustration consider a digital asset management application that >> displays mime type as facets. A query could return 1 million images and, >> say, 10 video. >> >> Is there a way we could support such scenarios (while still counting results >> per facet) and have a performant implementation? >> > We can opt for ACL-Checking/Parsing at most X (let's say 1000) nodes. If > we're done within it, then we can output the actual number. In case > after 1000 nodes checked we still have some left we can leave the number > either empty or with something like "many", "+", or any other fancy way > if we want. > > In the end is the same approach taken by Amazon (as Tommaso already > pointed) or for example google. If you run a search, their facets > (Searches related to...) are never with results.
I don't think Amazon and Google have customers that can demand them to show correct facet counts...our customers typically do :). My take on on this would be to have a configurable option between 1) exact and possibly slow counts 2) unauthorized, possibly incorrect, fast counts Obviously, the second just uses the faceted navigation counts from the backing search implementation (with node by node access manager check), whether it is the internal lucene index, solr or Elastic Search. If you opt for the second option, then, depending on your authorization model you can get fast exact authorized counts as well : When the authorization model can be translated into a search query / filter that is AND-ed with every normal search. For ES this is briefly written at [1]. Most likely the filter is internally cached so even for very large authorization queries (like we have at Hippo because of fine grained ACL model) it will just perform. Obviously it depends quite heavily on your authorization model whether it can be translated to a query. If it relies on an external authorization check or has many hierarchical constraints, it will be very hard. If you choose to have it based on, say, nodetype, nodename, node properties and jcr:path (fake pseudo property) it can be easily translated to a query. Note that for the jcr:path hierarchical ACL (eg read everything below /foo) it is not possible to write a lucene query easily unless you index path information as well....this results in that moves of large subtree's are slow because the entire subtree needs to be re-indexed. A different authorization model might be based on groups, where every node also gets the groups (the token of the group) indexed that can read that node. Although I never looked much into the code, I suspect [2] does something like this. So, instead of second guessing which might be acceptable (slow queries, wrong counts, etc) for which customers/users I'd try to keep the options open, have a default of correct (slow) counts, and make it easy to flip to 'counts from the indexes without accessmanager authorization', where depending on the authorization model, the latter can return correct results. For those who are interested, I will be listening to [3] this afternoon (5 pm GMT). Regards Ard [1] http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-aliases.html#filtered [2] http://manifoldcf.apache.org/en_US/index.html [3] http://www.elasticsearch.com/webinars/shield-securing-your-data-in-elasticsearch/ > > D. > > > > -- Amsterdam - Oosteinde 11, 1017 WT Amsterdam Boston - 1 Broadway, Cambridge, MA 02142 US +1 877 414 4776 (toll free) Europe +31(0)20 522 4466 www.onehippo.com
