[ 
https://issues.apache.org/jira/browse/OAK-9811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562678#comment-17562678
 ] 

Thomas Mueller commented on OAK-9811:
-------------------------------------

{noformat}
/oak:index/statistics/index
/oak:index/statistics/index/jcr:createdBy { count: 10240, uniqueHLL: 454543 }
/oak:index/statistics/index/jcr:primaryType { ... }
/oak:index/statistics/index/hidden
{noformat}


> Statistics index
> ----------------
>
>                 Key: OAK-9811
>                 URL: https://issues.apache.org/jira/browse/OAK-9811
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: indexing, query
>            Reporter: Thomas Mueller
>            Assignee: Thomas Mueller
>            Priority: Major
>
> Queries should be as fast as possible:
> * They should read as little data as possible (low I/O)
> * Network roundtrips should be reduced (see also OAK-9780)
> * In-memory processing should be fast (low CPU usage)
> To do that:
> * Queries needs to _have_ the right indexes. Possibly indexes need to be 
> added (which might be a manual task, or semi-automated, or fully automated). 
> For a developer, it would also be good to know how fast a query could be, if 
> an index is added.
> * Queries should _use_ the right indexes. Sometimes multiple indexes can be 
> used.
> * Queries should use the right execution plan (for example: a join can be 
> evaluated in multiple ways).
> For this, it is great to have accurate statistics. We currently have 
> statistics about number of nodes per path ([approximate 
> counter|https://github.com/apache/jackrabbit-oak/tree/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/index/counter]),
>  and document statistics for Lucene and Elastic indexes. 
> But we don't have statistics for _unindexed_ data currently. That would be 
> good to have: which property (by property name) is how common? How many 
> distinct values are there per property?  What is the histogram? And so on. 
> For this, something like the counter index could be added, that is updated 
> using a streaming algorithm. We need to ensure the number of writes to this 
> index is low (e.g. less than 1% of the overall writes), and memory usage is 
> very low. There are a number of such libraries, but arguably we could 
> implement this ourselves, as our use case is untypical (reduced number of 
> writes, reduced memory usage). https://github.com/thomasmueller/tinyStats and 
> related libraries could be used as a starting point.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to