On Wed, Dec 10, 2014 at 9:32 AM, Davide Giannella <[email protected]> wrote:
> On 09/12/2014 17:10, Michael Marth wrote:
>> ...
>>
>> The use cases problematic case for counting the facets I have in mind are 
>> when a query returns millions of results. This is problematic when one wants 
>> to retrieve the exact size of the result set (taking ACLs into account, 
>> obviously). When facets are to be retrieved this will be an even harder 
>> problem (meaning when the exact number is to be calculated per facet).
>> As an illustration consider a digital asset management application that 
>> displays mime type as facets. A query could return 1 million images and, 
>> say, 10 video.
>>
>> Is there a way we could support such scenarios (while still counting results 
>> per facet) and have a performant implementation?
>>
> We can opt for ACL-Checking/Parsing at most X (let's say 1000) nodes. If
> we're done within it, then we can output the actual number. In case
> after 1000 nodes checked we still have some left we can leave the number
> either empty or with something like "many", "+", or any other fancy way
> if we want.
>
> In the end is the same approach taken by Amazon (as Tommaso already
> pointed) or for example google. If you run a search, their facets
> (Searches related to...) are never with results.

I don't think Amazon and Google have customers that can demand them to
show correct facet counts...our customers typically do :). My take on
on this would be to have a configurable option between

1) exact and possibly slow counts
2) unauthorized, possibly incorrect, fast counts

Obviously, the second just uses the faceted navigation counts from the
backing search implementation (with node by node access manager
check), whether it is the internal lucene index, solr or Elastic
Search. If you opt for the second option, then, depending on your
authorization model you can get fast exact authorized counts as well :
When the authorization model can be translated into a search query /
filter that is AND-ed with every normal search. For ES this is briefly
written at [1]. Most likely the filter is internally cached so even
for very large authorization queries (like we have at Hippo because of
fine grained ACL model) it will just perform. Obviously it depends
quite heavily on your authorization model whether it can be translated
to a query. If  it relies on an external authorization check or has
many hierarchical constraints, it will be very hard. If you choose to
have it based on, say, nodetype, nodename, node properties and
jcr:path (fake pseudo property) it can be easily translated to a
query. Note that for the jcr:path hierarchical ACL (eg read everything
below /foo) it is not possible to write a lucene query easily unless
you index path information as well....this results in that moves of
large subtree's are slow because the entire subtree needs to be
re-indexed. A different authorization model might be based on groups,
where every node also gets the groups (the token of the group) indexed
that can read that node. Although I never looked much into the code, I
suspect [2] does something like this.

So, instead of second guessing which might be acceptable (slow
queries, wrong counts, etc) for which customers/users I'd try to keep
the options open, have a default of correct (slow) counts, and make it
easy to flip to 'counts from the indexes without accessmanager
authorization', where depending on the authorization model, the latter
can return correct results.

For those who are interested, I will be listening to [3] this
afternoon (5 pm GMT).

Regards Ard

[1] 
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-aliases.html#filtered
[2] http://manifoldcf.apache.org/en_US/index.html
[3] 
http://www.elasticsearch.com/webinars/shield-securing-your-data-in-elasticsearch/


>
> D.
>
>
>
>



-- 
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

Reply via email to