[ 
https://issues.apache.org/jira/browse/OAK-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813073#comment-16813073
 ] 

Vikas Saurabh commented on OAK-8167:
------------------------------------

[~anchela], let me try to describe what happens currently in the feature and 
one flow where _I_ feel there's information leakage.

The facet feature is essentially about counting how many items in a result of a 
query belong to a given category (usually termed as facet label) e.g. if I 
search for "hello" on some e-com site then the facets would show something like 
Books(200), Accessories(150), etc.
Lucene provides pretty quick real facet counts for a given query right away but 
those counts don't accommodate ACLs.
Default "secure" facet config iterates over the whole result set and corrects 
what lucene returned according to how many items were accessible - this is 
painfully bad from performance pov in any reasonable sized repository.
With statistical counting, what we do is we randomly pick 1000 (default but 
configurable) items from the result set and get ratio of how many items of 
those 1000 are accessible to the user. We multiple facet counts given by lucene 
with ratio and round down.

e.g. result set size - 10000 with facets as l1 - 500, l2 - 499. Sampling 1000 
items from it might give 80% sampled items were accessible. Then we'd return 
the facets as l1 - 400 and l2 - 399 (due to rounding down)

About information leakage, consider following scenario:
// Numbers as returned by lucene
RS - 10000
l1 - 200
l2 - 1100

// Accessible counts but unknown to us/code
RSA - 8000
l1A - 100
l2A - 1100

sampled accessible ratio should be pretty close to 0.8 (8000/10000), so the 
counts that we'd return would be something like:
l1F - ~0.8 * 200 = ~160
l2F - ~0.8 * 1100 = ~880

Let's assume user knows l2 is completely accessible. So they can drill down 
into l2 (query with AND facet_prop=l2) to get:
RS_l2 - 1100
l2_l2 - 1100

// This is what's the reality is for drilled down query
RSA_l2 - 1100
l2A_l2 - 1100

sample accessible ratio would be 1.0 as all items are accessible
l2F_l2 - 1.0 * 1100 = 1100

With this information it can be deduced that l2 was multiplied by ~0.8 to get 
l2F. Which, in turn, implies that l1 is l1F/0.8 = 200 and hence leaking number 
of items labelled l1 even without having access to all of them.

There are at least 2 notable points:
* drill down count could be retrieved by iterating through drilled down search 
result as well
* the calculation to deduce requires the user to know l2 - the farther the 
guess of l2 is from actual l2, the farther would be the deduction of l1

I hope that this description clarifies what the feature is doing.

> With uneven distribution of ACL restriction across facet labels statistical 
> facet count become too inaccurate
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: OAK-8167
>                 URL: https://issues.apache.org/jira/browse/OAK-8167
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>          Components: lucene, query
>    Affects Versions: 1.6.16
>            Reporter: Kelvin Xu
>            Priority: Major
>              Labels: vulnerability
>
> With the statistical mode, facet count is updated proportionally to the 
> percentage of accessible samples, which works for secured contents scattered 
> across different facets. For edge case where the whole facet (results) is not 
> accessible, the count still shows a number after the sampling percent is 
> applied. Even if the number is small, user experience is 
> misleading/inaccurate as nothing would return when the facet is clicked 
> (applied as a query condition).
> For example, a ACLs/CUGs guarded "private" folder, in which all the assets 
> are tagged with the same facet value. Non authorized user may still see this 
> facet with a count but gets nothing when clicking on the facet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to