[jira] [Commented] (OAK-9811) Statistics index

Thomas Mueller (Jira) Thu, 14 Jul 2022 03:16:07 -0700


    [ 
https://issues.apache.org/jira/browse/OAK-9811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566748#comment-17566748
 ]


Thomas Mueller commented on OAK-9811:
-------------------------------------

[~nitigupt] had a very nice idea: Instead of updating the statics online (via 
index editor), we could build the statistics via oak-run "index" command. 
There, we iterate over all the nodes anyway. In this mode, we could collect 
higher resolution data, and keep all data fully in memory. I think we should 
consider both the online update approach, as well as this bulk / 
[ETL|https://en.wikipedia.org/wiki/Extract,_transform,_load] approach.

> Statistics index
> ----------------
>
>                 Key: OAK-9811
>                 URL: https://issues.apache.org/jira/browse/OAK-9811
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: indexing, query
>            Reporter: Thomas Mueller
>            Assignee: Thomas Mueller
>            Priority: Major
>
> Queries should be as fast as possible:
> * They should read as little data as possible (low I/O)
> * Network roundtrips should be reduced (see also OAK-9780)
> * In-memory processing should be fast (low CPU usage)
> To do that:
> * Queries needs to _have_ the right indexes. Possibly indexes need to be 
> added (which might be a manual task, or semi-automated, or fully automated). 
> For a developer, it would also be good to know how fast a query could be, if 
> an index is added.
> * Queries should _use_ the right indexes. Sometimes multiple indexes can be 
> used.
> * Queries should use the right execution plan (for example: a join can be 
> evaluated in multiple ways).
> For this, it is great to have accurate statistics. We currently have 
> statistics about number of nodes per path ([approximate 
> counter|https://github.com/apache/jackrabbit-oak/tree/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/index/counter]),
>  and document statistics for Lucene and Elastic indexes. 
> But we don't have statistics for _unindexed_ data currently. That would be 
> good to have: which property (by property name) is how common? How many 
> distinct values are there per property?  What is the histogram? And so on. 
> For this, something like the counter index could be added, that is updated 
> using a streaming algorithm. We need to ensure the number of writes to this 
> index is low (e.g. less than 1% of the overall writes), and memory usage is 
> very low. There are a number of such libraries, but arguably we could 
> implement this ourselves, as our use case is untypical (reduced number of 
> writes, reduced memory usage). https://github.com/thomasmueller/tinyStats and 
> related libraries could be used as a starting point.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (OAK-9811) Statistics index

Reply via email to