2014-12-10 10:17 GMT+01:00 Ard Schrijvers <[email protected]>:

> On Wed, Dec 10, 2014 at 9:32 AM, Davide Giannella <[email protected]>
> wrote:
> > On 09/12/2014 17:10, Michael Marth wrote:
> >> ...
> >>
> >> The use cases problematic case for counting the facets I have in mind
> are when a query returns millions of results. This is problematic when one
> wants to retrieve the exact size of the result set (taking ACLs into
> account, obviously). When facets are to be retrieved this will be an even
> harder problem (meaning when the exact number is to be calculated per
> facet).
> >> As an illustration consider a digital asset management application that
> displays mime type as facets. A query could return 1 million images and,
> say, 10 video.
> >>
> >> Is there a way we could support such scenarios (while still counting
> results per facet) and have a performant implementation?
> >>
> > We can opt for ACL-Checking/Parsing at most X (let's say 1000) nodes. If
> > we're done within it, then we can output the actual number. In case
> > after 1000 nodes checked we still have some left we can leave the number
> > either empty or with something like "many", "+", or any other fancy way
> > if we want.
> >
> > In the end is the same approach taken by Amazon (as Tommaso already
> > pointed) or for example google. If you run a search, their facets
> > (Searches related to...) are never with results.
>
> I don't think Amazon and Google have customers that can demand them to
> show correct facet counts...our customers typically do :).


I see, however something along the lines of what Davide was proposing
doesn't sound too bad to me even for such use cases (but I may be wrong).


> My take on
> on this would be to have a configurable option between
>
> 1) exact and possibly slow counts
> 2) unauthorized, possibly incorrect, fast counts
>
> Obviously, the second just uses the faceted navigation counts from the
> backing search implementation (with node by node access manager
> check), whether it is the internal lucene index, solr or Elastic
> Search. If you opt for the second option, then, depending on your
> authorization model you can get fast exact authorized counts as well :
> When the authorization model can be translated into a search query /
> filter that is AND-ed with every normal search. For ES this is briefly
> written at [1]. Most likely the filter is internally cached so even
> for very large authorization queries (like we have at Hippo because of
> fine grained ACL model) it will just perform. Obviously it depends
> quite heavily on your authorization model whether it can be translated
> to a query. If  it relies on an external authorization check or has
> many hierarchical constraints, it will be very hard. If you choose to
> have it based on, say, nodetype, nodename, node properties and
> jcr:path (fake pseudo property) it can be easily translated to a
> query. Note that for the jcr:path hierarchical ACL (eg read everything
> below /foo) it is not possible to write a lucene query easily unless
> you index path information as well....this results in that moves of
> large subtree's are slow because the entire subtree needs to be
> re-indexed. A different authorization model might be based on groups,
> where every node also gets the groups (the token of the group) indexed
> that can read that node. Although I never looked much into the code, I
> suspect [2] does something like this.
>

that's what I had in mind in my proposal #4, the hurdles there relate to
the fact that each index implementation aiming at providing facets would
have to implement such an index and search with ACLs which is not trivial.
One possibly good thing is that this is for sure not a new issue, as you
pointed out Apache ManifoldCF has something like that for Solr (and I think
for ES too). One the other hand this would differ quite a bit from the
approach taken so far (indexes see just node and properties, the
QueryEngine post filters results on ACLs, node types, etc.), so that'd be a
significant change.


>
> So, instead of second guessing which might be acceptable (slow
> queries, wrong counts, etc) for which customers/users I'd try to keep
> the options open, have a default of correct (slow) counts, and make it
> easy to flip to 'counts from the indexes without accessmanager
> authorization', where depending on the authorization model, the latter
> can return correct results.
>

I think the best way of addressing this is by try prototyping (some of) the
mentioned options and see where we get, I'll see what I can do there.


>
> For those who are interested, I will be listening to [3] this
> afternoon (5 pm GMT).
>

cool, thanks for the pointer!

Regards,
Tommaso


>
> Regards Ard
>
> [1]
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-aliases.html#filtered
> [2] http://manifoldcf.apache.org/en_US/index.html
> [3]
> http://www.elasticsearch.com/webinars/shield-securing-your-data-in-elasticsearch/
>
>
> >
> > D.
> >
> >
> >
> >
>
>
>
> --
> Amsterdam - Oosteinde 11, 1017 WT Amsterdam
> Boston - 1 Broadway, Cambridge, MA 02142
>
> US +1 877 414 4776 (toll free)
> Europe +31(0)20 522 4466
> www.onehippo.com
>

Reply via email to