Hi, > > The big question is: Do you need the results paged at all? > > Yup, because if we return all results, we get OME.
You get the OME because the paging collector cannot handle that, so this is an XY problem. Would it not be better if you application just gets the results as a stream and processes them one after each other? If this is the case (and most statistics need it like that), your much better to NOT USE TOPDOCS!!!! Your requirement is diametral to getting top-scoring documents! You want to get ALL results as a sequence. > > Do you need them sorted? > > Nope. OK, so unsorted streaming is the right approach. > > If not, the easiest approach is to use a custom Collector that does no > sorting and just consumes the results. > > Main bottleneck as I see come from next page search, that took ~2-4 > seconds. This is because when paging the collector has to re-execute the whole query and sort all results again, just with a larger window. So if you have result pages of 50000 results and you want to get the second page, it will internally sort 100000 results, because the first page needs to be calculated, too. If you go forward in results the windows gets larger and larger, until it finally collects all results. So just get the results as a stream by implementing the Collector API is the right way to do this. > > > > Uwe > > > > ----- > > Uwe Schindler > > H.-H.-Meier-Allee 63, D-28213 Bremen > > http://www.thetaphi.de > > eMail: u...@thetaphi.de > > > >> -----Original Message----- > >> From: Valentin Popov [mailto:valentin...@gmail.com] > >> Sent: Thursday, November 12, 2015 6:48 PM > >> To: java-user@lucene.apache.org > >> Subject: Re: 500 millions document for loop. > >> > >> Toke, thanks! > >> > >> We will look at this solution, looks like this is that what we need. > >> > >> > >>> On 12 нояб. 2015 г., at 20:42, Toke Eskildsen <t...@statsbiblioteket.dk> > >> wrote: > >>> > >>> Valentin Popov <valentin...@gmail.com> wrote: > >>> > >>>> We have ~10 indexes for 500M documents, each document > >>>> has «archive date», and «to» address, one of our task is > >>>> calculate statistics of «to» for last year. Right now we are > >>>> using search archive_date:(current_date - 1 year) and paginate > >>>> results for 50k records for page. Bottleneck of that approach, > >>>> pagination take too long time and on powerful server it take > >>>> ~20 days to execute, and it is very long. > >>> > >>> Lucene does not like deep page requests due to the way the internal > >> Priority Queue works. Solr has CursorMark, which should be fairly simple > to > >> emulate in your Lucene handling code: > >>> > >>> http://lucidworks.com/blog/2013/12/12/coming-soon-to-solr-efficient- > >> cursor-based-iteration-of-large-result-sets/ > >>> > >>> - Toke Eskildsen > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >>> For additional commands, e-mail: java-user-h...@lucene.apache.org > >>> > >> > >> Regards, > >> Valentin Popov > >> > >> > >> > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > С Уважением, > Валентин Попов > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org