[jira] [Commented] (OAK-3092) Cache recently extracted text to avoid duplicate extraction

Alex Parvulescu (JIRA) Mon, 09 Nov 2015 05:11:23 -0800

    [ 
https://issues.apache.org/jira/browse/OAK-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14996501#comment-14996501
 ]


Alex Parvulescu commented on OAK-3092:
--------------------------------------

looks good! +1

how complicated would it be to try to purge old/unneeded entries from the cache 
based on the referencing content being removed? bookkeeping for the binary ids 
would be a pain and I'm not sure the gains are worth it, how much binary 
volatile content would end up in this cache anyway?

> Cache recently extracted text to avoid duplicate extraction
> -----------------------------------------------------------
>
>                 Key: OAK-3092
>                 URL: https://issues.apache.org/jira/browse/OAK-3092
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>              Labels: performance
>             Fix For: 1.3.11
>
>         Attachments: OAK-3092-v1.patch, OAK-3092-v2.patch
>
>
> It can happen that text can be extracted from same binary multiple times in a 
> given indexing cycle. This can happen due to 2 reasons
> # Multiple Lucene indexes indexing same node - A system might have multiple 
> Lucene indexes e.g. a global Lucene index and an index for specific nodeType. 
> In a given indexing cycle same file would be picked up by both index 
> definition and both would extract same text
> # Aggregation - With Index time aggregation same file get picked up multiple 
> times due to aggregation rules
> To avoid the wasted effort for duplicate text extraction from same file in a 
> given indexing cycle it would be better to have an expiring cache which can 
> hold on to extracted text content for some time. The cache should have 
> following features
> # Limit on total size
> # Way to expire the content using [Timed 
> Evicition|https://code.google.com/p/guava-libraries/wiki/CachesExplained#Timed_Eviction]
>  - As chances of same file getting picked up are high only for a given 
> indexing cycle it would be better to expire the cache entries after some time 
> to avoid hogging memory unnecessarily 
> Such a cache would provide following benefit
> # Avoid duplicate text extraction - Text extraction is costly and has to be 
> minimized on critical path of {{indexEditor}}
> # Avoid expensive IO specially if binary content are to be fetched from a 
> remote {{BlobStore}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (OAK-3092) Cache recently extracted text to avoid duplicate extraction

Reply via email to