[
https://issues.apache.org/jira/browse/OAK-4368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15282896#comment-15282896
]
Tommaso Teofili commented on OAK-4368:
--------------------------------------
I've done some experimenting with {{PostingsHighlighter}} which promises better
runtime performance at the cost of slightly more disk space; in order to
minimize the impact on index size I've only enabled that for analyzed fields
which are currently indexed using {{FieldFactory.OAK_TYPE}} and already storing
{{DOCS_AND_FREQS_AND_POSITIONS}}.
In order to use the {{PostingsHighlighter}} I had to set the
{{FieldFactory.OAK_TYPE}} to use {{DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS}},
then changed the logic in excerpt extraction to try {{PostingsHighlighter}}
first on fields that are analyzed and fallback to plain {{Highlighter}} if no
highlighting is found (although it may make sense to unify the approach as in
the worst case this would be less performant than the previous patch).
When the {{PostingsHighlighter}} only is used the performance gain is around
40%.
> Excerpt extraction from the Lucene index should be more selective
> -----------------------------------------------------------------
>
> Key: OAK-4368
> URL: https://issues.apache.org/jira/browse/OAK-4368
> Project: Jackrabbit Oak
> Issue Type: Improvement
> Components: lucene
> Affects Versions: 1.0.30, 1.2.14, 1.4.2, 1.5.2
> Reporter: Tommaso Teofili
> Assignee: Tommaso Teofili
> Fix For: 1.5.3
>
> Attachments: OAK-4368.0.patch
>
>
> Lucene index can be used in order to extract _rep:excerpt_ using
> {{Highlighter}}.
> The current implementation may suffer performance issues when the result set
> of the original query contains a lot of results, each of them possibly
> containing lots of (stored) properties that get passed to the highlighter in
> order to try to extract the excerpt; such a process doesn't stop as soon as
> the first excerpt is found so that excerpt is composed using text from all
> stored properties in all results (if there's a match on the query).
> While we can accept some cost of extracting excerpt at query time (whereas it
> was generated at excerpt retrieval time before OAK-3580, e.g. via
> _row.getValue("rep:excerpt")_) , that should be bounded and mitigated as much
> as possible.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)