Re: [jira] Commented: (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets

Jan Høydahl / Cominvent Thu, 11 Nov 2010 12:23:09 -0800

The problem with large "start" is probably worse when sharding is involved. 
Anyone know how the shard component goes about fetching start=1000000&rows=10 
from say 10 shards? Does it have to merge sorted lists of 1mill+10 docsids from 
each shard which is the worst case?


--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 10. nov. 2010, at 20.22, Hoss Man (JIRA) wrote:

> 
>    [ 
> https://issues.apache.org/jira/browse/SOLR-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930723#action_12930723
>  ] 
> 
> Hoss Man commented on SOLR-2218:
> --------------------------------
> 
> The performance gets slower as the start increases because in order to give 
> you rows N...M sorted by score solr must collect the the top M documents (in 
> sorted order) Lance's point is that if you use "sort=_docid_+asc" this 
> collection of top ranking documents in sorted order doesn't have to happen.
> 
> If you have to use sorting, keep in mind that the decrease in performance as 
> the "start" param increases w/o bounds is primarily driven by the amount of 
> documents that have to be collected/compared on the sort field -- something 
> thta wouldn't change if yo have a named cursor (you would just be paying that 
> cost up front instead of per request).
> 
> You should be able to get equivalent functionality by reducing the number of 
> collected documents -- instead of increasing the start param, add a filter on 
> the sort field indicating that you only want documents with a field value 
> higher (or lower if using "desc" sort) then the last document so far 
> encountered.  (if you are sorting on score this becomes tricker, but should 
> be possible using the "frange" parser wit the "query" function)
> 
>> Performance of start= and rows= parameters are exponentially slow with large 
>> data sets
>> --------------------------------------------------------------------------------------
>> 
>>                Key: SOLR-2218
>>                URL: https://issues.apache.org/jira/browse/SOLR-2218
>>            Project: Solr
>>         Issue Type: Improvement
>>         Components: Build
>>   Affects Versions: 1.4.1
>>           Reporter: Bill Bell
>> 
>> With large data sets, > 10M rows.
>> Setting start=<large number> and rows=<large numbers> is slow, and gets 
>> slower the farther you get from start=0 with a complex query. Random also 
>> makes this slower.
>> Would like to somehow make this performance faster for looping through large 
>> data sets. It would be nice if we could pass a pointer to the result set to 
>> loop, or support very large rows=<number>.
>> Something like:
>> rows=1000
>> start=0
>> spointer=string_my_query_1
>> Then within interval (like 5 mins) I can reference this loop:
>> Something like:
>> rows=1000
>> start=1000
>> spointer=string_my_query_1
>> What do you think? Since the data is too great the cache is not helping.
> 
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [jira] Commented: (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets

Reply via email to