[
https://issues.apache.org/jira/browse/OAK-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18067253#comment-18067253
]
Julian Reschke commented on OAK-3092:
-------------------------------------
trunk: (1.3.11)
[8781d8c0f1|https://github.com/apache/jackrabbit-oak/commit/8781d8c0f1b661534a6dc13e38f8061a0a11e856]
1.22: (1.3.11)
[8781d8c0f1|https://github.com/apache/jackrabbit-oak/commit/8781d8c0f1b661534a6dc13e38f8061a0a11e856]
> Cache recently extracted text to avoid duplicate extraction
> -----------------------------------------------------------
>
> Key: OAK-3092
> URL: https://issues.apache.org/jira/browse/OAK-3092
> Project: Jackrabbit Oak
> Issue Type: Improvement
> Components: lucene
> Affects Versions: 1.0.24, 1.2.8, 1.3.11
> Reporter: Chetan Mehrotra
> Assignee: Chetan Mehrotra
> Priority: Major
> Labels: performance
> Fix For: 1.0.24, 1.2.8, 1.3.11, 1.4
>
> Attachments: OAK-3092-v1.patch, OAK-3092-v2.patch
>
>
> It can happen that text can be extracted from same binary multiple times in a
> given indexing cycle. This can happen due to 2 reasons
> # Multiple Lucene indexes indexing same node - A system might have multiple
> Lucene indexes e.g. a global Lucene index and an index for specific nodeType.
> In a given indexing cycle same file would be picked up by both index
> definition and both would extract same text
> # Aggregation - With Index time aggregation same file get picked up multiple
> times due to aggregation rules
> To avoid the wasted effort for duplicate text extraction from same file in a
> given indexing cycle it would be better to have an expiring cache which can
> hold on to extracted text content for some time. The cache should have
> following features
> # Limit on total size
> # Way to expire the content using [Timed
> Evicition|https://code.google.com/p/guava-libraries/wiki/CachesExplained#Timed_Eviction]
> - As chances of same file getting picked up are high only for a given
> indexing cycle it would be better to expire the cache entries after some time
> to avoid hogging memory unnecessarily
> Such a cache would provide following benefit
> # Avoid duplicate text extraction - Text extraction is costly and has to be
> minimized on critical path of {{indexEditor}}
> # Avoid expensive IO specially if binary content are to be fetched from a
> remote {{BlobStore}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)