[
https://issues.apache.org/jira/browse/OAK-6735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vikas Saurabh updated OAK-6735:
-------------------------------
Attachment: IndexReadPattern.txt
LuceneIndexReadPattern.java
So, I was trying to see how much read would lucene incur while calculating
various stats. I used [^LuceneIndexReadPattern.java] (it has a few hard-coded
paths for indexed data on my setup).
Following is the size of indices I extracted the stats from:
{noformat}
$ du -sh */data
364K damAssetLucene-1505227087108/data
36M lucene-1505227210399/data
4.2G PetabyteDamAssetLucene/data
19G PetabyteLucene/data
46M someLuceneIdx/data
{noformat}
The complete output is at [^IndexReadPattern.txt].
Few interesting things to note:
* opening reader reads quite a bit - but, we open reader only on index refresh
(and that we've been incurring this cost even today anyway)
* reading numDocs, and reading numTermsPerField didn't incur any read even on
/oak:index/lucene that AEM provisions (index size at 19G)
* reading numDocsAgainstATerm does require read (although in large indices)
So, I think, we'd need to limit ourselves with termsPerField if we bind with
index refresh.
If we want some deeper stats collection, then it'd have to happen infrequently
in some background thread.
> Lucene Index: improved cost estimation by using document count per field
> ------------------------------------------------------------------------
>
> Key: OAK-6735
> URL: https://issues.apache.org/jira/browse/OAK-6735
> Project: Jackrabbit Oak
> Issue Type: Improvement
> Components: lucene, query
> Affects Versions: 1.7.4
> Reporter: Thomas Mueller
> Fix For: 1.8
>
> Attachments: IndexReadPattern.txt, LuceneIndexReadPattern.java
>
>
> The cost estimation of the Lucene index is somewhat inaccurate because (by
> default) it just used the number of documents in the index (as of Oak 1.7.4
> by default, due to OAK-6333).
> Instead, it should use the number of documents for the given fields (the
> minimum, if there are multiple fields with restrictions).
> Plus divided by the number of restrictions (as we do now already).
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)