[jira] Commented: (LUCENE-2130) Investigate Rewriting Constant Scoring MultiTermQueries per segment

Michael McCandless (JIRA) Sun, 11 Jul 2010 14:06:16 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887230#action_12887230
 ]


Michael McCandless commented on LUCENE-2130:
--------------------------------------------

I think the slowness comes from a particularly bad interaction b/w how 
FuzzyTermsEnum uses MultiTermsEnum.seek during the rewrite, and MTE's impl of 
seek.

EG if you seek MTE to term Naa.  Say there are 4 segments, that seek to these 
terms:  Nab M O P.

If you next seek to term Nac, then the 3 segments (M O P above) will do tons of 
work (a full re-seek, ie binary search through the index, re-scan) only to 
discover they are still at M O P.  There's an O(N^2) factor at play, where N = 
term index interval, and the constant factor is highish (decoding each term 
entry on scan).

I think we could conceivably fix MTE to better handle this "seeking a bit 
forward at a time" case, eg by keeping track of last seek term and not touching 
the M O P segments since we know they'll just seek back to the same point, I 
suspect if we do so the perf will still be sizably slower than if we can 
rewrite per-segment.

> Investigate Rewriting Constant Scoring MultiTermQueries per segment
> -------------------------------------------------------------------
>
>                 Key: LUCENE-2130
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2130
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Mark Miller
>             Fix For: 4.0
>
>         Attachments: LUCENE-2130.patch, LUCENE-2130.patch
>
>
> This issue is likely not to go anywhere, but I thought we might explore it. 
> The only idea I have come up with is fairly ugly, and unless something better 
> comes up, this is not likely to happen.
> But if we could rewrite constant score multi-term queries per segment, MTQ's 
> with auto (when the heuristic doesnt cut over to constant filter), or 
> constant boolean rewrite could enum terms against a single segment and then 
> apply a boolean query against each segment with just the terms that are known 
> to be in that segment. This also allows you to avoid DirectoryReaders 
> MultiTermEnum and its PQ. (See Roberts comment below).
> No biggie, not likely, but what the heck.
> So the ugly way to do it is to add a property to query's and weights - 
> lateCnstRewrite or something, that defaults to false. MTQ would return true 
> if its in a constant score mode. On the top level rewrite, if this is 
> detected, an empty ConstantScoreQuery is made, and its Weight is turned to 
> lateCnstRewrite and it keeps a ref to the original MTQ query. It also gets 
> its boost set to the MTQ's boost. Then when we are searching per segment, if 
> the Weight is lateCnstRewrite, we grab the orig query and actually do the 
> rewrite against the subreader and grab the actual constantscore weight. It 
> works I think - but its a little ugly.
> Not sure its worth the baggage for the win - but perhaps the objective can be 
> met in another way.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2130) Investigate Rewriting Constant Scoring MultiTermQueries per segment

Reply via email to