[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

Jake Mannix (JIRA) Sun, 25 Oct 2009 13:32:25 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12769860#action_12769860
 ]


Jake Mannix commented on LUCENE-1997:
-------------------------------------

Mark, you say with the previous numbers, you'd say "-1", but if you look at the 
most common use case (top 10), the simpler API is faster in almost all cases, 
and in some cases it's 10-20% faster.   Top 500, top 1000 are not only just 
"not as common", they're probably at the 1% level, or less.

As far as shifting back, API-wise, that really shouldn't be a factor: 2.9 
*just* came out, and what, we stick with a slightly *slower* API (for the most 
common use case across all Lucene users), which happens to be *more complex*, 
and more importantly: just very nonstandard - Comparable is very familiar to 
everyone, even if you have to have two forms, one for primitives, one for 
Objects - an api which *doesn't* have the whole slew of compare(), 
compareBottom(), copy(), setBottom(), value() and setNextReader() has a 
tremendous advantage over one which does.  

It's "advanced" to implement a custom sort, but it will be *easier* if it's not 
complex, and then it doesn't *need* to be "advanced" (shouldn't we be striving 
to make there be less APIs which are listed as "advanced", and instead more 
features which can *do* complex things but are still listed as things "normal 
users" can do).

I think it's *great* precedent to set with users to say, "oops!  we found that 
this new (just now as of this version) api was unnecessarily clumsy, we're 
shifting back to a simpler one which is just like the one you used to have".  
Sticking with a worse api because it performs better in only extreme scenarios 
because "we already moved on to this new api, shouldn't go back now, don't want 
to admit we ever made a mistake!" is what is "ugly".

The main thing to remember is that the entire thinking around making this 
different from the old was *only* because it seemed that using a simpler api 
would perform much worse than this one, and it does not appear that this is the 
case.  If that original reasoning turns out to have been incorrect, then the 
answer is simple: go with the simpler API *now* before users *do* get used to 
using the new one.

If it turns out I'm wrong, and lots of users sort based on field values for the 
top 1000 entries often, or that the most recent runs turn out to be flukes and 
are not typical performance, only then would I'd change my opinion.

> Explore performance of multi-PQ vs single-PQ sorting API
> --------------------------------------------------------
>
>                 Key: LUCENE-1997
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1997
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, 
> LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests.  Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available).  Then
> it runs various combinations:
>   * Index with 20 balanced segments vs index with the "normal" log
>     segment size
>   * Queries with different numbers of hits (only for wikipedia index)
>   * Different top N
>   * Different sorts (by title, for wikipedia, and by random string,
>     random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept.  The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

Reply via email to