[ https://issues.apache.org/jira/browse/OAK-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813073#comment-16813073 ]
Vikas Saurabh commented on OAK-8167: ------------------------------------ [~anchela], let me try to describe what happens currently in the feature and one flow where _I_ feel there's information leakage. The facet feature is essentially about counting how many items in a result of a query belong to a given category (usually termed as facet label) e.g. if I search for "hello" on some e-com site then the facets would show something like Books(200), Accessories(150), etc. Lucene provides pretty quick real facet counts for a given query right away but those counts don't accommodate ACLs. Default "secure" facet config iterates over the whole result set and corrects what lucene returned according to how many items were accessible - this is painfully bad from performance pov in any reasonable sized repository. With statistical counting, what we do is we randomly pick 1000 (default but configurable) items from the result set and get ratio of how many items of those 1000 are accessible to the user. We multiple facet counts given by lucene with ratio and round down. e.g. result set size - 10000 with facets as l1 - 500, l2 - 499. Sampling 1000 items from it might give 80% sampled items were accessible. Then we'd return the facets as l1 - 400 and l2 - 399 (due to rounding down) About information leakage, consider following scenario: // Numbers as returned by lucene RS - 10000 l1 - 200 l2 - 1100 // Accessible counts but unknown to us/code RSA - 8000 l1A - 100 l2A - 1100 sampled accessible ratio should be pretty close to 0.8 (8000/10000), so the counts that we'd return would be something like: l1F - ~0.8 * 200 = ~160 l2F - ~0.8 * 1100 = ~880 Let's assume user knows l2 is completely accessible. So they can drill down into l2 (query with AND facet_prop=l2) to get: RS_l2 - 1100 l2_l2 - 1100 // This is what's the reality is for drilled down query RSA_l2 - 1100 l2A_l2 - 1100 sample accessible ratio would be 1.0 as all items are accessible l2F_l2 - 1.0 * 1100 = 1100 With this information it can be deduced that l2 was multiplied by ~0.8 to get l2F. Which, in turn, implies that l1 is l1F/0.8 = 200 and hence leaking number of items labelled l1 even without having access to all of them. There are at least 2 notable points: * drill down count could be retrieved by iterating through drilled down search result as well * the calculation to deduce requires the user to know l2 - the farther the guess of l2 is from actual l2, the farther would be the deduction of l1 I hope that this description clarifies what the feature is doing. > With uneven distribution of ACL restriction across facet labels statistical > facet count become too inaccurate > ------------------------------------------------------------------------------------------------------------- > > Key: OAK-8167 > URL: https://issues.apache.org/jira/browse/OAK-8167 > Project: Jackrabbit Oak > Issue Type: Bug > Components: lucene, query > Affects Versions: 1.6.16 > Reporter: Kelvin Xu > Priority: Major > Labels: vulnerability > > With the statistical mode, facet count is updated proportionally to the > percentage of accessible samples, which works for secured contents scattered > across different facets. For edge case where the whole facet (results) is not > accessible, the count still shows a number after the sampling percent is > applied. Even if the number is small, user experience is > misleading/inaccurate as nothing would return when the facet is clicked > (applied as a query condition). > For example, a ACLs/CUGs guarded "private" folder, in which all the assets > are tagged with the same facet value. Non authorized user may still see this > facet with a count but gets nothing when clicking on the facet. -- This message was sent by Atlassian JIRA (v7.6.3#76005)