[jira] [Comment Edited] (OAK-6735) Lucene Index: improved cost estimation by using document count per field

Vikas Saurabh (JIRA) Tue, 24 Oct 2017 02:12:09 -0700

    [ 
https://issues.apache.org/jira/browse/OAK-6735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16200696#comment-16200696
 ]


Vikas Saurabh edited comment on OAK-6735 at 10/24/17 9:10 AM:
--------------------------------------------------------------

So, I was trying to see how much read would lucene incur while calculating 
various stats. I used [^LuceneIndexReadPattern.java] (it has a few hard-coded 
paths for indexed data on my setup).

Following is the size of indices I extracted the stats from:
{noformat}
$ du -sh */data
364K    damAssetLucene-1505227087108/data
36M     lucene-1505227210399/data
4.2G    PetabyteDamAssetLucene/data
19G     PetabyteLucene/data
46M     someLuceneIdx/data
{noformat}

The complete output is at [^IndexReadPattern.txt].

Few interesting things to note:
* opening reader reads quite a bit - but, we open reader only on index refresh 
(and that we've been incurring this cost even today anyway)
* reading numDocs, and reading -numTermsPerField- numDocsPerField didn't incur 
any read even on /oak:index/lucene that AEM provisions (index size at 19G)
* reading numDocsAgainstATerm does require read (although in large indices)

So, I think, we'd need to limit ourselves with termsPerField if we bind with 
index refresh.

If we want some deeper stats collection, then it'd have to happen infrequently 
in some background thread.


was (Author: catholicon):
So, I was trying to see how much read would lucene incur while calculating 
various stats. I used [^LuceneIndexReadPattern.java] (it has a few hard-coded 
paths for indexed data on my setup).

Following is the size of indices I extracted the stats from:
{noformat}
$ du -sh */data
364K    damAssetLucene-1505227087108/data
36M     lucene-1505227210399/data
4.2G    PetabyteDamAssetLucene/data
19G     PetabyteLucene/data
46M     someLuceneIdx/data
{noformat}

The complete output is at [^IndexReadPattern.txt].

Few interesting things to note:
* opening reader reads quite a bit - but, we open reader only on index refresh 
(and that we've been incurring this cost even today anyway)
* reading numDocs, and reading numTermsPerField didn't incur any read even on 
/oak:index/lucene that AEM provisions (index size at 19G)
* reading numDocsAgainstATerm does require read (although in large indices)

So, I think, we'd need to limit ourselves with termsPerField if we bind with 
index refresh.

If we want some deeper stats collection, then it'd have to happen infrequently 
in some background thread.

> Lucene Index: improved cost estimation by using document count per field
> ------------------------------------------------------------------------
>
>                 Key: OAK-6735
>                 URL: https://issues.apache.org/jira/browse/OAK-6735
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene, query
>    Affects Versions: 1.7.4
>            Reporter: Thomas Mueller
>            Assignee: Vikas Saurabh
>             Fix For: 1.8
>
>         Attachments: IndexReadPattern.txt, LuceneIndexReadPattern.java
>
>
> The cost estimation of the Lucene index is somewhat inaccurate because (by 
> default) it just used the number of documents in the index (as of Oak 1.7.4 
> by default, due to OAK-6333).
> Instead, it should use the number of documents for the given fields (the 
> minimum, if there are multiple fields with restrictions). 
> Plus divided by the number of restrictions (as we do now already).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (OAK-6735) Lucene Index: improved cost estimation by using document count per field

Reply via email to