[jira] Commented: (LUCENE-2939) Highlighter should try and use maxDocCharsToAnalyze in WeightedSpanTermExtractor when adding a new field to MemoryIndex

Robert Muir (JIRA) Sat, 26 Feb 2011 20:45:31 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12999880#comment-12999880
 ]


Robert Muir commented on LUCENE-2939:
-------------------------------------

i don't know why you get this null pointer exception (maybe you triggered a 
bug), but...

just a quick glance:
# why use offsets for this calculation? This seems a bit dangerous versus other 
approaches.
# either way, the reset() method should clear any state such as counters in the 
tokenstream.

As far as what i meant above... the whole maxDocCharsToAnalyze seems like the 
wrong measure.
Why not specify this just as max tokens, and use LimitTokenCountAnalyzer, which 
is already implemented.

using arbitrary chars and offsets is going to create fake tokens (e.g. truncate 
words) and other problems.
besides, its not unicode safe since a codepoint might span multiple chars.


> Highlighter should try and use maxDocCharsToAnalyze in 
> WeightedSpanTermExtractor when adding a new field to MemoryIndex
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2939
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2939
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>            Reporter: Mark Miller
>            Assignee: Mark Miller
>            Priority: Minor
>         Attachments: LUCENE-2939.patch
>
>
> huge documents can be drastically slower than need be because the entire 
> field is added to the memory index
> this cost can be greatly reduced in many cases if we try and respect 
> maxDocCharsToAnalyze
> the cost is still not fantastic, but is at least improved in many situations 
> and can be influenced with this change

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2939) Highlighter should try and use maxDocCharsToAnalyze in WeightedSpanTermExtractor when adding a new field to MemoryIndex

Reply via email to