Author: catholicon
Date: Tue Dec 18 01:02:02 2018
New Revision: 1849135
URL: http://svn.apache.org/viewvc?rev=1849135&view=rev
Log:
OAK-7939: Create/Update documentation regarding secure facet counting
Added:
jackrabbit/oak/trunk/oak-doc/src/site/resources/img/facets-statistical-error-rate-plot.png
(with props)
Modified:
jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/lucene.md
Modified: jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/lucene.md
URL:
http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/lucene.md?rev=1849135&r1=1849134&r2=1849135&view=diff
==============================================================================
--- jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/lucene.md (original)
+++ jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/lucene.md Tue Dec 18
01:02:02 2018
@@ -1323,19 +1323,88 @@ Lucene property indexes can also be used
Specific facet related features for Lucene property index can be configured in
a separate _facets_ node below the
index definition.
- By default ACL checks are always performed on facets by the Lucene property
index however this can be avoided by setting
- the property _secure_ to _false_ in the _facets_ configuration node.
`@since Oak 1.5.15` The no. of facets to be retrieved is configurable via the
_topChildren_ property, which defaults to 10.
-
```
-/oak:index/lucene-with-unsecure-facets
+/oak:index/lucene-with-more-facets
- jcr:primaryType = "oak:QueryIndexDefinition"
- compatVersion = 2
- type = "lucene"
- async = "async"
+ facets
- topChildren = 100
- - secure = false
+ + indexRules
+ - jcr:primaryType = "nt:unstructured"
+ + nt:base
+ + properties
+ - jcr:primaryType = "nt:unstructured"
+ + tags
+ - facets = true
+ - propertyIndex = true
+```
+
+By default ACL checks are always performed on facets by the Lucene property
index however there are a few configuration
+option to configure how ACL checks are done by configuring _secure_ property
in the _facets_ configuration node.
+`@since Oak 1.6.16, 1.8.10, 1.9.13` `secure` property is a string with allowed
values of `secure`, `statistical` and
+`insecure` - `secure` being the default value. Before that `secure` was a
boolean property and to maintain compatibility
+`false` maps to `insecure` while `true` (default at the time) maps to `secure`.
+
+For `insecure` facets, the facet counts reported by lucene index are reported
back as is.
+For `secure` configuration all results of a query are checked for access
permissions and facets returned by index are
+updated accordingly. This can be very bad from performance point of view for
large result set.
+As a trade off `statistical` configuration can be used to randomly sample some
items (default `1000` configurable via
+`sampleSize`) and check ACL for the random samples. Facet counts returned via
index are updated proportionally to the
+percentage of accessible samples that were checked for ACL.
+Do note that the [beauty of
sampling](https://onlinecourses.science.psu.edu/stat100/node/16/) is that a
sample size of
+`1000` would have 3% error rate with 95% confidence. But that's a theoretical
limit for infinite number of experiments -
+in practice though, a low rate of accessible documents decreases chances to
reach that average rate. To have a sense of
+expectation of error rate, here's how errors looked like in different
scenarios of test runs with sample size of 1000
+with error averaged over 1000 random runs for each scenario.
+```
+|-----------------|-----------------------|------------------------|
+| Result set size | %age accessible nodes | Avg error in 1000 runs |
+|-----------------|-----------------------|------------------------|
+| 2000 | 5 | 5.79 |
+| 5000 | 5 | 9.99 |
+| 10000 | 5 | 10.938 |
+| 100000 | 5 | 11.13 |
+| | | |
+| 2000 | 25 | 2.4192004 |
+| 5000 | 25 | 3.8087976 |
+| 10000 | 25 | 4.096 |
+| 100000 | 25 | 4.3699985 |
+| | | |
+| 2000 | 50 | 1.3990011 |
+| 5000 | 50 | 2.2695997 |
+| 10000 | 50 | 2.5303981 |
+| 100000 | 50 | 2.594599 |
+| | | |
+| 2000 | 75 | 0.80360085 |
+| 5000 | 75 | 1.1929348 |
+| 10000 | 75 | 1.4357346 |
+| 100000 | 75 | 1.4272015 |
+| | | |
+| 2000 | 95 | 0.30958 |
+| 5000 | 95 | 0.52715933 |
+| 10000 | 95 | 0.5109484 |
+| 100000 | 95 | 0.5481065 |
+|-----------------|-----------------------|------------------------|
+```
+
+
+Notice that error rate does increase with large result set sizes but it
flattens after around 10000 results. Also, note
+that even with 50% results being accessible, error rate averages at less that
3%.
+
+So, in most cases, sampling size of 1000 should give fairly decent estimation
of facet counts. On the off chance that
+the setup is such that error rates are intolerable, sample size can be
configured with _sampleSize_ property under
+_facets_ configuration node. Error rates are generally inversely proportional
to `âsample-size`. So, to reduce error
+rate by 1/2 sample size needs to increased 4 times.
+
+Canonical example of `statistical` configuration would look like:
+```
+/oak:index/lucene-with-statistical-facets
+ + facets
+ - secure = "statistical"
+ - sampleSize = 1500
+ indexRules
- jcr:primaryType = "nt:unstructured"
+ nt:base
@@ -1929,4 +1998,4 @@ SELECT rep:facet(title) FROM [app:Asset]
[jcr-contains]:
http://www.day.com/specs/jcr/1.0/6.6.5.2_jcr_contains_Function.html
[boost-faq]:
https://wiki.apache.org/lucene-java/LuceneFAQ#How_do_I_make_sure_that_a_match_in_a_document_title_has_greater_weight_than_a_match_in_a_document_body.3F
[score-explanation]:
https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/IndexSearcher.html#explain%28org.apache.lucene.search.Query,%20int%29
-[oak-lucene]: http://www.javadoc.io/doc/org.apache.jackrabbit/oak-lucene/
\ No newline at end of file
+[oak-lucene]: http://www.javadoc.io/doc/org.apache.jackrabbit/oak-lucene/
Added:
jackrabbit/oak/trunk/oak-doc/src/site/resources/img/facets-statistical-error-rate-plot.png
URL:
http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-doc/src/site/resources/img/facets-statistical-error-rate-plot.png?rev=1849135&view=auto
==============================================================================
Binary file - no diff available.
Propchange:
jackrabbit/oak/trunk/oak-doc/src/site/resources/img/facets-statistical-error-rate-plot.png
------------------------------------------------------------------------------
svn:mime-type = image/png