Hi, > On 12 нояб. 2015 г., at 21:15, Uwe Schindler <u...@thetaphi.de> wrote: > > Hi, > >>> The big question is: Do you need the results paged at all? >> >> Yup, because if we return all results, we get OME. > > You get the OME because the paging collector cannot handle that, so this is > an XY problem. Would it not be better if you application just gets the > results as a stream and processes them one after each other? If this is the > case (and most statistics need it like that), your much better to NOT USE > TOPDOCS!!!! Your requirement is diametral to getting top-scoring documents! > You want to get ALL results as a sequence.
Use, thanks if it is possible, could you provide some code example? > >>> Do you need them sorted? >> >> Nope. > > OK, so unsorted streaming is the right approach. > >>> If not, the easiest approach is to use a custom Collector that does no >> sorting and just consumes the results. >> >> Main bottleneck as I see come from next page search, that took ~2-4 >> seconds. > > This is because when paging the collector has to re-execute the whole query > and sort all results again, just with a larger window. So if you have result > pages of 50000 results and you want to get the second page, it will > internally sort 100000 results, because the first page needs to be > calculated, too. If you go forward in results the windows gets larger and > larger, until it finally collects all results. Is this mean we are not using cursor based iteration? > > So just get the results as a stream by implementing the Collector API is the > right way to do this. thanks! > >>> >>> Uwe >>> >>> ----- >>> Uwe Schindler >>> H.-H.-Meier-Allee 63, D-28213 Bremen >>> http://www.thetaphi.de >>> eMail: u...@thetaphi.de >>> >>>> -----Original Message----- >>>> From: Valentin Popov [mailto:valentin...@gmail.com] >>>> Sent: Thursday, November 12, 2015 6:48 PM >>>> To: java-user@lucene.apache.org >>>> Subject: Re: 500 millions document for loop. >>>> >>>> Toke, thanks! >>>> >>>> We will look at this solution, looks like this is that what we need. >>>> >>>> >>>>> On 12 нояб. 2015 г., at 20:42, Toke Eskildsen <t...@statsbiblioteket.dk> >>>> wrote: >>>>> >>>>> Valentin Popov <valentin...@gmail.com> wrote: >>>>> >>>>>> We have ~10 indexes for 500M documents, each document >>>>>> has «archive date», and «to» address, one of our task is >>>>>> calculate statistics of «to» for last year. Right now we are >>>>>> using search archive_date:(current_date - 1 year) and paginate >>>>>> results for 50k records for page. Bottleneck of that approach, >>>>>> pagination take too long time and on powerful server it take >>>>>> ~20 days to execute, and it is very long. >>>>> >>>>> Lucene does not like deep page requests due to the way the internal >>>> Priority Queue works. Solr has CursorMark, which should be fairly simple >> to >>>> emulate in your Lucene handling code: >>>>> >>>>> http://lucidworks.com/blog/2013/12/12/coming-soon-to-solr-efficient- >>>> cursor-based-iteration-of-large-result-sets/ >>>>> >>>>> - Toke Eskildsen >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>> >>>> >>>> Regards, >>>> Valentin Popov >>>> >>>> >>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >> >> >> С Уважением, >> Валентин Попов >> >> >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > Regards, Valentin Popov --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org