Author: catholicon
Date: Tue Dec 18 01:02:02 2018
New Revision: 1849135

URL: http://svn.apache.org/viewvc?rev=1849135&view=rev
Log:
OAK-7939: Create/Update documentation regarding secure facet counting

Added:
    
jackrabbit/oak/trunk/oak-doc/src/site/resources/img/facets-statistical-error-rate-plot.png
   (with props)
Modified:
    jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/lucene.md

Modified: jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/lucene.md
URL: 
http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/lucene.md?rev=1849135&r1=1849134&r2=1849135&view=diff
==============================================================================
--- jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/lucene.md (original)
+++ jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/lucene.md Tue Dec 18 
01:02:02 2018
@@ -1323,19 +1323,88 @@ Lucene property indexes can also be used
 
 Specific facet related features for Lucene property index can be configured in 
a separate _facets_ node below the
  index definition.
- By default ACL checks are always performed on facets by the Lucene property 
index however this can be avoided by setting
- the property _secure_ to _false_ in the _facets_ configuration node.
 `@since Oak 1.5.15` The no. of facets to be retrieved is configurable via the 
_topChildren_ property, which defaults to 10.
-
 ```
-/oak:index/lucene-with-unsecure-facets
+/oak:index/lucene-with-more-facets
   - jcr:primaryType = "oak:QueryIndexDefinition"
   - compatVersion = 2
   - type = "lucene"
   - async = "async"
   + facets
     - topChildren = 100
-    - secure = false
+  + indexRules
+    - jcr:primaryType = "nt:unstructured"
+    + nt:base
+      + properties
+        - jcr:primaryType = "nt:unstructured"
+        + tags
+          - facets = true
+          - propertyIndex = true
+```
+
+By default ACL checks are always performed on facets by the Lucene property 
index however there are a few configuration
+option to configure how ACL checks are done by configuring _secure_ property 
in the _facets_ configuration node.
+`@since Oak 1.6.16, 1.8.10, 1.9.13` `secure` property is a string with allowed 
values of `secure`, `statistical` and
+`insecure` - `secure` being the default value. Before that `secure` was a 
boolean property and to maintain compatibility
+`false` maps to `insecure` while `true` (default at the time) maps to `secure`.
+
+For `insecure` facets, the facet counts reported by lucene index are reported 
back as is.
+For `secure` configuration all results of a query are checked for access 
permissions and facets returned by index are
+updated accordingly. This can be very bad from performance point of view for 
large result set.
+As a trade off `statistical` configuration can be used to randomly sample some 
items (default `1000` configurable via
+`sampleSize`) and check ACL for the random samples. Facet counts returned via 
index are updated proportionally to the
+percentage of accessible samples that were checked for ACL.
+Do note that the [beauty of 
sampling](https://onlinecourses.science.psu.edu/stat100/node/16/) is that a 
sample size of
+`1000` would have 3% error rate with 95% confidence. But that's a theoretical 
limit for infinite number of experiments -
+in practice though, a low rate of accessible documents decreases chances to 
reach that average rate. To have a sense of
+expectation of error rate, here's how errors looked like in different 
scenarios of test runs with sample size of 1000
+with error averaged over 1000 random runs for each scenario.
+```
+|-----------------|-----------------------|------------------------|
+| Result set size | %age accessible nodes | Avg error in 1000 runs |
+|-----------------|-----------------------|------------------------|
+| 2000            |  5                    |  5.79                  |
+| 5000            |  5                    |  9.99                  |
+| 10000           |  5                    |  10.938                |
+| 100000          |  5                    |  11.13                 |
+|                 |                       |                        |
+| 2000            | 25                    | 2.4192004              |
+| 5000            | 25                    | 3.8087976              |
+| 10000           | 25                    | 4.096                  |
+| 100000          | 25                    | 4.3699985              |
+|                 |                       |                        |
+| 2000            | 50                    | 1.3990011              |
+| 5000            | 50                    | 2.2695997              |
+| 10000           | 50                    | 2.5303981              |
+| 100000          | 50                    | 2.594599               |
+|                 |                       |                        |
+| 2000            | 75                    | 0.80360085             |
+| 5000            | 75                    | 1.1929348              |
+| 10000           | 75                    | 1.4357346              |
+| 100000          | 75                    | 1.4272015              |
+|                 |                       |                        |
+| 2000            | 95                    | 0.30958                |
+| 5000            | 95                    | 0.52715933             |
+| 10000           | 95                    | 0.5109484              |
+| 100000          | 95                    | 0.5481065              |
+|-----------------|-----------------------|------------------------|
+```
+![error rate plot](../img/facets-statistical-error-rate-plot.png)
+
+Notice that error rate does increase with large result set sizes but it 
flattens after around 10000 results. Also, note
+that even with 50% results being accessible, error rate averages at less that 
3%.
+
+So, in most cases, sampling size of 1000 should give fairly decent estimation 
of facet counts. On the off chance that
+the setup is such that error rates are intolerable, sample size can be 
configured with _sampleSize_ property under
+_facets_ configuration node. Error rates are generally inversely proportional 
to `√sample-size`. So, to reduce error
+rate by 1/2 sample size needs to increased 4 times.
+
+Canonical example of `statistical` configuration would look like:
+```
+/oak:index/lucene-with-statistical-facets
+  + facets
+    - secure = "statistical"
+    - sampleSize = 1500
   + indexRules
     - jcr:primaryType = "nt:unstructured"
     + nt:base
@@ -1929,4 +1998,4 @@ SELECT rep:facet(title) FROM [app:Asset]
 [jcr-contains]: 
http://www.day.com/specs/jcr/1.0/6.6.5.2_jcr_contains_Function.html
 [boost-faq]: 
https://wiki.apache.org/lucene-java/LuceneFAQ#How_do_I_make_sure_that_a_match_in_a_document_title_has_greater_weight_than_a_match_in_a_document_body.3F
 [score-explanation]: 
https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/IndexSearcher.html#explain%28org.apache.lucene.search.Query,%20int%29
-[oak-lucene]: http://www.javadoc.io/doc/org.apache.jackrabbit/oak-lucene/
\ No newline at end of file
+[oak-lucene]: http://www.javadoc.io/doc/org.apache.jackrabbit/oak-lucene/

Added: 
jackrabbit/oak/trunk/oak-doc/src/site/resources/img/facets-statistical-error-rate-plot.png
URL: 
http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-doc/src/site/resources/img/facets-statistical-error-rate-plot.png?rev=1849135&view=auto
==============================================================================
Binary file - no diff available.

Propchange: 
jackrabbit/oak/trunk/oak-doc/src/site/resources/img/facets-statistical-error-rate-plot.png
------------------------------------------------------------------------------
    svn:mime-type = image/png


Reply via email to