The problem with large "start" is probably worse when sharding is involved. Anyone know how the shard component goes about fetching start=1000000&rows=10 from say 10 shards? Does it have to merge sorted lists of 1mill+10 docsids from each shard which is the worst case?
-- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 10. nov. 2010, at 20.22, Hoss Man (JIRA) wrote: > > [ > https://issues.apache.org/jira/browse/SOLR-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930723#action_12930723 > ] > > Hoss Man commented on SOLR-2218: > -------------------------------- > > The performance gets slower as the start increases because in order to give > you rows N...M sorted by score solr must collect the the top M documents (in > sorted order) Lance's point is that if you use "sort=_docid_+asc" this > collection of top ranking documents in sorted order doesn't have to happen. > > If you have to use sorting, keep in mind that the decrease in performance as > the "start" param increases w/o bounds is primarily driven by the amount of > documents that have to be collected/compared on the sort field -- something > thta wouldn't change if yo have a named cursor (you would just be paying that > cost up front instead of per request). > > You should be able to get equivalent functionality by reducing the number of > collected documents -- instead of increasing the start param, add a filter on > the sort field indicating that you only want documents with a field value > higher (or lower if using "desc" sort) then the last document so far > encountered. (if you are sorting on score this becomes tricker, but should > be possible using the "frange" parser wit the "query" function) > >> Performance of start= and rows= parameters are exponentially slow with large >> data sets >> -------------------------------------------------------------------------------------- >> >> Key: SOLR-2218 >> URL: https://issues.apache.org/jira/browse/SOLR-2218 >> Project: Solr >> Issue Type: Improvement >> Components: Build >> Affects Versions: 1.4.1 >> Reporter: Bill Bell >> >> With large data sets, > 10M rows. >> Setting start=<large number> and rows=<large numbers> is slow, and gets >> slower the farther you get from start=0 with a complex query. Random also >> makes this slower. >> Would like to somehow make this performance faster for looping through large >> data sets. It would be nice if we could pass a pointer to the result set to >> loop, or support very large rows=<number>. >> Something like: >> rows=1000 >> start=0 >> spointer=string_my_query_1 >> Then within interval (like 5 mins) I can reference this loop: >> Something like: >> rows=1000 >> start=1000 >> spointer=string_my_query_1 >> What do you think? Since the data is too great the cache is not helping. > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
