[jira] [Commented] (LUCENE-5415) Support wildcard & co in PostingsHighlighter

Robert Muir (JIRA) Fri, 24 Jan 2014 16:03:10 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13881548#comment-13881548
 ]


Robert Muir commented on LUCENE-5415:
-------------------------------------

{quote}
How will the FakeDocsEnum.freq() lie affect the default PassageScorer? Will 
this bias against passages that had an MTQ match?
{quote}

Terribly. Yes. Its a prototype :) But remember: these are typically 
constant-score in lucene.

{quote}
So, all MTQs are squished into a single fake/virtual term for matching, like I 
cannot tell which of the N MTQs in my query caused the hit. I think this is OK 
for starters: it's unusual (maybe?) to run multiple MTQs and to also care about 
which one matched each hit in the highlight. But I guess we could instead add N 
virtual terms, one per MTQ... later.
{quote}

Same as above: its a prototype. I avoided an automaton UNION operation (scared 
of perf, and well, multiple MTQs are rarish). But who uses the API to look at 
the terms? Does telling them which MTQ matched really seem that important? 
(nothing in the highlighter api uses this today!!!!!!) They have the offsets to 
know the actual text that matched still, so i mean... I think its ok for now? 

In both cases: I tried to avoid special casing this stuff (sorry, i think if 
you have a serious search system, then wildcards are rare), and instead add 
them in a non-disruptive way so that its clear it doesnt break the algorithm, 
which I think is generally working "ok".


> Support wildcard & co in PostingsHighlighter
> --------------------------------------------
>
>                 Key: LUCENE-5415
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5415
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: Robert Muir
>         Attachments: LUCENE-5415.patch
>
>
> PostingsHighlighter uses the offsets encoded in the postings lists for the 
> terms to find query matches.
> As such, it isn't really suitable for stuff like wildcards for two reasons:
> 1. an expensive rewrite against the term dictionary (i think other 
> highlighters share this problem)
> 2. accumulating data from potentially many terms (e.g. reading many postings)
> However, we could provide an option for some of these queries to work, but in 
> a different way, that avoids these downsides.
> Instead we can just grab the Automaton representation of the queries, and 
> match it against the content directly (which won't blow up).



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5415) Support wildcard & co in PostingsHighlighter

Reply via email to