Hello, Sorry to chime in so late in this thread, hope my remarks are still welcome. I did read the entire thread, and won't reply in line, but just try to recap and explain how we got around it in Hippo repository. The problem is obvious:
*** How to get efficiently a correct count of total hits when finegrained authorization is involved *** Regarding the remark in this thread : 'For example, it doesn't make any sense to display "1045430 hits" if calculating this number takes 1.5 hours' , I wholeheartedly agree, but our customers *never* agree. They want the exact hit count, no matter what So, we tackled this at Hippo as follows. 1) Next to getSize() iterator we also added getTotalSize(). I don't like the name because it is actually more something like: getTotalSizeWithoutCheckingACLs(). This method give you back directly the number of hits from the backing search index. That one is fast of course. What is slow, is authorizing potentially 1.000.000 hits because all those nodes need to be fetched from backing storage, etc etc. However, most of our customers have an application that show the results for some siteuser below some folder : The siteuser has read access for the entire folder. We just show the getTotalSizeWithoutCheckingACLs() as total hits. Worst case, the number is higher than the actual number the siteuser is allowed to read 2) We have our ACLs based on node properties. Hence, we have been able to create an AuthorizationQuery, mapped directly to a cached Lucene bitset. When a jcr session searches in our repository, we combine his cached authorization bitset. After changes in the repository we need to reload these bitsets (on request) but they are shared between all users that have the same authorization. It is blistering fast that way, resulting in correct authorized counts Now, I don't think (2) can be part of oak, as it implies a certain ACL model which is not generic enough. Quite some ACL mappings of course cannot be translated to a Lucene query. However, (1) should be very issue, and already is a lot better. It is up to the developer then to use getTotalSizeWithoutCheckingACLs() (and then a decent name :-) or not My 2 cents Regards Ard On Tue, Sep 11, 2012 at 12:08 PM, Jukka Zitting <[email protected]> wrote: > Hi, > > [moving this to oak-dev@ for a broader discussion] > > On Tue, Sep 11, 2012 at 9:55 AM, Thomas Mueller (JIRA) <[email protected]> > wrote: >> [...] For compatibility with Jackrabbit 2.0, and for ease of use, it would >> be good to >> have a clearly defined way to get the size of the result. [...] > > I've always found the -1 return value from getSize() incredibly > annoying as it forces client code to use extra conditionals and go > through extra hoops if the size turns out not to be available. There > are basically three potential scenarios: > > 1. The client doesn't need to know the size, so it never calls getSize(). > 2. The client does need to know the size, so it calls getSize() and > has to iterate through all results if getSize() returns -1. > 3. The client could use the size (for UI, optimization, etc.), so it > calls getSize() and ignores the result if its -1. > > The main problem I have with the -1 return value is that case 2 > becomes really annoying to handle. > > Instead I'd propose the following design: > > * The getSize() method always returns the size, by buffering all > results in memory if necessary. > * A separate hasSize() method can be used to check if the size is > quickly available (i.e. if getSize() will complete in O(1) time). > > With such a design the above cases become easier to handle: > > 1. The client doesn't need to know the size, so it never calls getSize(). > 2. The client does need to know the size, so it calls getSize(). > 3. The client could use the size (for UI, optimization, etc.), so it > calls hasSize() and possibly follows up with getSize(). > > PS. Note that implementing an "estimated size" feature like seen in > many public search engines ("results 1-10 of thousands") is really > difficult to implement in a manner that's both efficient and secure. > Public search engines can make such estimates efficiently since all > their content is public and they thus don't need to worry about > accidentally leaking sensitive information. > > BR, > > Jukka Zitting -- Amsterdam - Oosteinde 11, 1017 WT Amsterdam Boston - 1 Broadway, Cambridge, MA 02142 US +1 877 414 4776 (toll free) Europe +31(0)20 522 4466 www.onehippo.com
