[
https://issues.apache.org/jira/browse/OAK-9811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566748#comment-17566748
]
Thomas Mueller commented on OAK-9811:
-------------------------------------
[~nitigupt] had a very nice idea: Instead of updating the statics online (via
index editor), we could build the statistics via oak-run "index" command.
There, we iterate over all the nodes anyway. In this mode, we could collect
higher resolution data, and keep all data fully in memory. I think we should
consider both the online update approach, as well as this bulk /
[ETL|https://en.wikipedia.org/wiki/Extract,_transform,_load] approach.
> Statistics index
> ----------------
>
> Key: OAK-9811
> URL: https://issues.apache.org/jira/browse/OAK-9811
> Project: Jackrabbit Oak
> Issue Type: Improvement
> Components: indexing, query
> Reporter: Thomas Mueller
> Assignee: Thomas Mueller
> Priority: Major
>
> Queries should be as fast as possible:
> * They should read as little data as possible (low I/O)
> * Network roundtrips should be reduced (see also OAK-9780)
> * In-memory processing should be fast (low CPU usage)
> To do that:
> * Queries needs to _have_ the right indexes. Possibly indexes need to be
> added (which might be a manual task, or semi-automated, or fully automated).
> For a developer, it would also be good to know how fast a query could be, if
> an index is added.
> * Queries should _use_ the right indexes. Sometimes multiple indexes can be
> used.
> * Queries should use the right execution plan (for example: a join can be
> evaluated in multiple ways).
> For this, it is great to have accurate statistics. We currently have
> statistics about number of nodes per path ([approximate
> counter|https://github.com/apache/jackrabbit-oak/tree/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/index/counter]),
> and document statistics for Lucene and Elastic indexes.
> But we don't have statistics for _unindexed_ data currently. That would be
> good to have: which property (by property name) is how common? How many
> distinct values are there per property? What is the histogram? And so on.
> For this, something like the counter index could be added, that is updated
> using a streaming algorithm. We need to ensure the number of writes to this
> index is low (e.g. less than 1% of the overall writes), and memory usage is
> very low. There are a number of such libraries, but arguably we could
> implement this ourselves, as our use case is untypical (reduced number of
> writes, reduced memory usage). https://github.com/thomasmueller/tinyStats and
> related libraries could be used as a starting point.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)