[ https://issues.apache.org/jira/browse/LUCENE-6392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Smiley updated LUCENE-6392: --------------------------------- Attachment: LUCENE-6392_highlight_term_vector_maxStartOffset.patch (Patch attached). Elaborating on the description: This patch includes a tweak to the TokenLL[] array size initialization to consider this new limit when guessing a good size. This patch includes memory saving optimizations to the information it accumulates. Before the patch, each TokenLL had a char[], so there were a total of 2 objects per token (including the token itself). Now I use a shared CharsRefBuilder with a pointer & length into it, so there's just 1 object now, plus byte savings by avoiding a char[] header. I also reduced the bytes needed for a TokenLL instance from 40 to 32. *It does assume that the char offset delta (endOffset - startOffset) can fit within a short*, which seems like a reasonable assumption to me. For safety I guard against overflow and substitute Short.MAX_VALUE. Finally, to encourage users to supply a limit (even if "-1" to mean no limit), I decided to deprecate many of the methods in TokenSources for new ones that include a limit parameter. But for those methods that fall-back to a provided Analyzer, _I have to wonder now if it makes sense for these methods to filter the analyzers_. I think it does -- if you want to limit the tokens then it shouldn't matter where you got them from -- you want to limit them. I haven't added that but I'm looking for feedback first. > Add offset limit to Highlighter's TokenStreamFromTermVector > ----------------------------------------------------------- > > Key: LUCENE-6392 > URL: https://issues.apache.org/jira/browse/LUCENE-6392 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter > Reporter: David Smiley > Assignee: David Smiley > Fix For: 5.2 > > Attachments: LUCENE-6392_highlight_term_vector_maxStartOffset.patch > > > The Highlighter's TokenStreamFromTermVector utility, typically accessed via > TokenSources, should have the ability to filter out tokens beyond a > configured offset. There is a TODO there already, and this issue addresses > it. New methods in TokenSources now propagate a limit. > This patch also includes some memory saving optimizations, to be described > shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org