[
https://issues.apache.org/jira/browse/OAK-9811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562678#comment-17562678
]
Thomas Mueller commented on OAK-9811:
-------------------------------------
{noformat}
/oak:index/statistics/index
/oak:index/statistics/index/jcr:createdBy { count: 10240, uniqueHLL: 454543 }
/oak:index/statistics/index/jcr:primaryType { ... }
/oak:index/statistics/index/hidden
{noformat}
> Statistics index
> ----------------
>
> Key: OAK-9811
> URL: https://issues.apache.org/jira/browse/OAK-9811
> Project: Jackrabbit Oak
> Issue Type: Improvement
> Components: indexing, query
> Reporter: Thomas Mueller
> Assignee: Thomas Mueller
> Priority: Major
>
> Queries should be as fast as possible:
> * They should read as little data as possible (low I/O)
> * Network roundtrips should be reduced (see also OAK-9780)
> * In-memory processing should be fast (low CPU usage)
> To do that:
> * Queries needs to _have_ the right indexes. Possibly indexes need to be
> added (which might be a manual task, or semi-automated, or fully automated).
> For a developer, it would also be good to know how fast a query could be, if
> an index is added.
> * Queries should _use_ the right indexes. Sometimes multiple indexes can be
> used.
> * Queries should use the right execution plan (for example: a join can be
> evaluated in multiple ways).
> For this, it is great to have accurate statistics. We currently have
> statistics about number of nodes per path ([approximate
> counter|https://github.com/apache/jackrabbit-oak/tree/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/index/counter]),
> and document statistics for Lucene and Elastic indexes.
> But we don't have statistics for _unindexed_ data currently. That would be
> good to have: which property (by property name) is how common? How many
> distinct values are there per property? What is the histogram? And so on.
> For this, something like the counter index could be added, that is updated
> using a streaming algorithm. We need to ensure the number of writes to this
> index is low (e.g. less than 1% of the overall writes), and memory usage is
> very low. There are a number of such libraries, but arguably we could
> implement this ourselves, as our use case is untypical (reduced number of
> writes, reduced memory usage). https://github.com/thomasmueller/tinyStats and
> related libraries could be used as a starting point.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)