[jira] [Updated] (OAK-9811) Statistics index

Thomas Mueller (Jira) Thu, 20 Oct 2022 06:37:06 -0700


     [ 
https://issues.apache.org/jira/browse/OAK-9811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Thomas Mueller updated OAK-9811:
--------------------------------
    Description: 
Queries should be as fast as possible:
* They should read as little data as possible (low I/O)
* Network roundtrips should be reduced (see also OAK-9780)
* In-memory processing should be fast (low CPU usage)

To do that:

* Queries needs to _have_ the right indexes. Possibly indexes need to be added 
(which might be a manual task, or semi-automated, or fully automated). For a 
developer, it would also be good to know how fast a query could be, if an index 
is added.
* Queries should _use_ the right indexes. Sometimes multiple indexes can be 
used.
* Queries should use the right execution plan (for example: a join can be 
evaluated in multiple ways).

For this, it is great to have accurate statistics. We currently have statistics 
about number of nodes per path ([approximate 
counter|https://github.com/apache/jackrabbit-oak/tree/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/index/counter]),
 and document statistics for Lucene and Elastic indexes. 

But we don't have statistics for _unindexed_ data currently. That would be good 
to have: which property (by property name) is how common? How many distinct 
values are there per property?  What is the histogram? And so on. For this, 
something like the counter index could be added, that is updated using a 
streaming algorithm. We need to ensure the number of writes to this index is 
low (e.g. less than 1% of the overall writes), and memory usage is very low. 
There are a number of such libraries, but arguably we could implement this 
ourselves, as our use case is untypical (reduced number of writes, reduced 
memory usage). https://github.com/thomasmueller/tinyStats and related libraries 
could be used as a starting point.

Additional use cases:
* Property names that are only used once or twice are likely typos

  was:
Queries should be as fast as possible:
* They should read as little data as possible (low I/O)
* Network roundtrips should be reduced (see also OAK-9780)
* In-memory processing should be fast (low CPU usage)

To do that:

* Queries needs to _have_ the right indexes. Possibly indexes need to be added 
(which might be a manual task, or semi-automated, or fully automated). For a 
developer, it would also be good to know how fast a query could be, if an index 
is added.
* Queries should _use_ the right indexes. Sometimes multiple indexes can be 
used.
* Queries should use the right execution plan (for example: a join can be 
evaluated in multiple ways).

For this, it is great to have accurate statistics. We currently have statistics 
about number of nodes per path ([approximate 
counter|https://github.com/apache/jackrabbit-oak/tree/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/index/counter]),
 and document statistics for Lucene and Elastic indexes. 

But we don't have statistics for _unindexed_ data currently. That would be good 
to have: which property (by property name) is how common? How many distinct 
values are there per property?  What is the histogram? And so on. For this, 
something like the counter index could be added, that is updated using a 
streaming algorithm. We need to ensure the number of writes to this index is 
low (e.g. less than 1% of the overall writes), and memory usage is very low. 
There are a number of such libraries, but arguably we could implement this 
ourselves, as our use case is untypical (reduced number of writes, reduced 
memory usage). https://github.com/thomasmueller/tinyStats and related libraries 
could be used as a starting point.


> Statistics index
> ----------------
>
>                 Key: OAK-9811
>                 URL: https://issues.apache.org/jira/browse/OAK-9811
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: indexing, query
>            Reporter: Thomas Mueller
>            Assignee: Thomas Mueller
>            Priority: Major
>
> Queries should be as fast as possible:
> * They should read as little data as possible (low I/O)
> * Network roundtrips should be reduced (see also OAK-9780)
> * In-memory processing should be fast (low CPU usage)
> To do that:
> * Queries needs to _have_ the right indexes. Possibly indexes need to be 
> added (which might be a manual task, or semi-automated, or fully automated). 
> For a developer, it would also be good to know how fast a query could be, if 
> an index is added.
> * Queries should _use_ the right indexes. Sometimes multiple indexes can be 
> used.
> * Queries should use the right execution plan (for example: a join can be 
> evaluated in multiple ways).
> For this, it is great to have accurate statistics. We currently have 
> statistics about number of nodes per path ([approximate 
> counter|https://github.com/apache/jackrabbit-oak/tree/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/index/counter]),
>  and document statistics for Lucene and Elastic indexes. 
> But we don't have statistics for _unindexed_ data currently. That would be 
> good to have: which property (by property name) is how common? How many 
> distinct values are there per property?  What is the histogram? And so on. 
> For this, something like the counter index could be added, that is updated 
> using a streaming algorithm. We need to ensure the number of writes to this 
> index is low (e.g. less than 1% of the overall writes), and memory usage is 
> very low. There are a number of such libraries, but arguably we could 
> implement this ourselves, as our use case is untypical (reduced number of 
> writes, reduced memory usage). https://github.com/thomasmueller/tinyStats and 
> related libraries could be used as a starting point.
> Additional use cases:
> * Property names that are only used once or twice are likely typos



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (OAK-9811) Statistics index

Reply via email to