[jira] [Commented] (LUCENE-6445) Highlighter TokenSources simplification; just one getAnyTokenStream()

David Smiley (JIRA) Mon, 20 Apr 2015 11:58:32 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-6445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503426#comment-14503426
 ]


David Smiley commented on LUCENE-6445:
--------------------------------------

What I propose is two methods:
{code}
getTokenStream(String field, Fields tvFields, String text, Analyzer analyzer, 
int maxStartOffset) throws IOException
{code}
and
{code}
getTermVectorTokenStreamIfPresent(String field,  Fields tvFields, 
maxStartOffset) throws IOException
{code}
All the others can be deprecated in 5x, removed in trunk.  If you supply a 
maxStartOffset, it should apply to either term vectors or analyzed text -- 
whichever it gets it from.  See  LUCENE-6423 LimitTokenOffsetFilter.  If the 
term vector doesn't have offsets then it won't be used.  Ditto for positions... 
and if the caller knows what it's doing and wants to use 
TokenStreamFromTermVector with offsets but not positions then it can do so, but 
not with these convenience methods.  The second method can return null; it's 
name is suggestive of that.  IOException is never masked in a RuntimeException.

> Highlighter TokenSources simplification; just one getAnyTokenStream()
> ---------------------------------------------------------------------
>
>                 Key: LUCENE-6445
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6445
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: David Smiley
>            Assignee: David Smiley
>
> The Highlighter "TokenSources" class has quite a few utility methods 
> pertaining to getting a TokenStream from either term vectors or analyzed 
> text.  I think it's too much:
> * some go to term vectors, some don't.  But if you don't want to go to term 
> vectors, then it's quite easy for the caller to invoke the Analyzer for the 
> field value, and to get that field value.
> * Some methods return null, some never null; I forget which at a glance.
> * Some methods read the Document (to get a field value) from the IndexReader, 
> some don't.  Furthermore, it's not an ideal place to get the doc since your 
> app might be using an IndexSearcher with a document cache (e.g. 
> SolrIndexSearcher).
> * None of the methods accept a Fields instance from term vectors as a 
> parameter.  Based on how Lucene's term vector format works, this is a 
> performance trap if you don't re-use an instance across fields on the 
> document that you're highlighting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-6445) Highlighter TokenSources simplification; just one getAnyTokenStream()

Reply via email to