[ 
https://issues.apache.org/jira/browse/LUCENE-6392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley updated LUCENE-6392:
---------------------------------
    Attachment: LUCENE-6392_highlight_term_vector_maxStartOffset.patch

(Patch attached).
Elaborating on the description:

This patch includes a tweak to the TokenLL[] array size initialization to 
consider this new limit when guessing a good size.

This patch includes memory saving optimizations to the information it 
accumulates.  Before the patch, each TokenLL had a char[], so there were a 
total of 2 objects per token (including the token itself).  Now I use a shared 
CharsRefBuilder with a pointer & length into it, so there's just 1 object now, 
plus byte savings by avoiding a char[] header.  I also reduced the bytes needed 
for a TokenLL instance from 40 to 32.  *It does assume that the char offset 
delta (endOffset - startOffset) can fit within a short*, which seems like a 
reasonable assumption to me. For safety I guard against overflow and substitute 
Short.MAX_VALUE.

Finally, to encourage users to supply a limit (even if "-1" to mean no limit), 
I decided to deprecate many of the methods in TokenSources for new ones that 
include a limit parameter.  But for those methods that fall-back to a provided 
Analyzer, _I have to wonder now if it makes sense for these methods to filter 
the analyzers_.  I think it does -- if you want to limit the tokens then it 
shouldn't matter where you got them from -- you want to limit them.  I haven't 
added that but I'm looking for feedback first.

> Add offset limit to Highlighter's TokenStreamFromTermVector
> -----------------------------------------------------------
>
>                 Key: LUCENE-6392
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6392
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: David Smiley
>            Assignee: David Smiley
>             Fix For: 5.2
>
>         Attachments: LUCENE-6392_highlight_term_vector_maxStartOffset.patch
>
>
> The Highlighter's TokenStreamFromTermVector utility, typically accessed via 
> TokenSources, should have the ability to filter out tokens beyond a 
> configured offset. There is a TODO there already, and this issue addresses 
> it.  New methods in TokenSources now propagate a limit.
> This patch also includes some memory saving optimizations, to be described 
> shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to