[jira] [Updated] (OAK-6735) Lucene Index: improved cost estimation by using document count per field

Vikas Saurabh (JIRA) Wed, 11 Oct 2017 11:05:16 -0700

     [ 
https://issues.apache.org/jira/browse/OAK-6735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Vikas Saurabh updated OAK-6735:
-------------------------------
    Attachment: IndexReadPattern.txt
                LuceneIndexReadPattern.java

So, I was trying to see how much read would lucene incur while calculating 
various stats. I used [^LuceneIndexReadPattern.java] (it has a few hard-coded 
paths for indexed data on my setup).

Following is the size of indices I extracted the stats from:
{noformat}
$ du -sh */data
364K    damAssetLucene-1505227087108/data
36M     lucene-1505227210399/data
4.2G    PetabyteDamAssetLucene/data
19G     PetabyteLucene/data
46M     someLuceneIdx/data
{noformat}

The complete output is at [^IndexReadPattern.txt].

Few interesting things to note:
* opening reader reads quite a bit - but, we open reader only on index refresh 
(and that we've been incurring this cost even today anyway)
* reading numDocs, and reading numTermsPerField didn't incur any read even on 
/oak:index/lucene that AEM provisions (index size at 19G)
* reading numDocsAgainstATerm does require read (although in large indices)

So, I think, we'd need to limit ourselves with termsPerField if we bind with 
index refresh.

If we want some deeper stats collection, then it'd have to happen infrequently 
in some background thread.

> Lucene Index: improved cost estimation by using document count per field
> ------------------------------------------------------------------------
>
>                 Key: OAK-6735
>                 URL: https://issues.apache.org/jira/browse/OAK-6735
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene, query
>    Affects Versions: 1.7.4
>            Reporter: Thomas Mueller
>             Fix For: 1.8
>
>         Attachments: IndexReadPattern.txt, LuceneIndexReadPattern.java
>
>
> The cost estimation of the Lucene index is somewhat inaccurate because (by 
> default) it just used the number of documents in the index (as of Oak 1.7.4 
> by default, due to OAK-6333).
> Instead, it should use the number of documents for the given fields (the 
> minimum, if there are multiple fields with restrictions). 
> Plus divided by the number of restrictions (as we do now already).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (OAK-6735) Lucene Index: improved cost estimation by using document count per field

Reply via email to