[jira] Commented: (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets

Hoss Man (JIRA) Wed, 10 Nov 2010 11:22:36 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930723#action_12930723
 ]


Hoss Man commented on SOLR-2218:
--------------------------------

The performance gets slower as the start increases because in order to give you 
rows N...M sorted by score solr must collect the the top M documents (in sorted 
order) Lance's point is that if you use "sort=_docid_+asc" this collection of 
top ranking documents in sorted order doesn't have to happen.

If you have to use sorting, keep in mind that the decrease in performance as 
the "start" param increases w/o bounds is primarily driven by the amount of 
documents that have to be collected/compared on the sort field -- something 
thta wouldn't change if yo have a named cursor (you would just be paying that 
cost up front instead of per request).

You should be able to get equivalent functionality by reducing the number of 
collected documents -- instead of increasing the start param, add a filter on 
the sort field indicating that you only want documents with a field value 
higher (or lower if using "desc" sort) then the last document so far 
encountered.  (if you are sorting on score this becomes tricker, but should be 
possible using the "frange" parser wit the "query" function)

> Performance of start= and rows= parameters are exponentially slow with large 
> data sets
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-2218
>                 URL: https://issues.apache.org/jira/browse/SOLR-2218
>             Project: Solr
>          Issue Type: Improvement
>          Components: Build
>    Affects Versions: 1.4.1
>            Reporter: Bill Bell
>
> With large data sets, > 10M rows.
> Setting start=<large number> and rows=<large numbers> is slow, and gets 
> slower the farther you get from start=0 with a complex query. Random also 
> makes this slower.
> Would like to somehow make this performance faster for looping through large 
> data sets. It would be nice if we could pass a pointer to the result set to 
> loop, or support very large rows=<number>.
> Something like:
> rows=1000
> start=0
> spointer=string_my_query_1
> Then within interval (like 5 mins) I can reference this loop:
> Something like:
> rows=1000
> start=1000
> spointer=string_my_query_1
> What do you think? Since the data is too great the cache is not helping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets

Reply via email to