RE: 500 millions document for loop.

Uwe Schindler Thu, 12 Nov 2015 10:16:22 -0800

Hi,

> > The big question is: Do you need the results paged at all?
> 
> Yup, because if we return all results, we get OME.


You get the OME because the paging collector cannot handle that, so this is an 
XY problem. Would it not be better if you application just gets the results as 
a stream and processes them one after each other? If this is the case (and most 
statistics need it like that), your much better to NOT USE TOPDOCS!!!! Your 
requirement is diametral to getting top-scoring documents! You want to get ALL 
results as a sequence.

> > Do you need them sorted?
> 
> Nope.

OK, so unsorted streaming is the right approach.

> > If not, the easiest approach is to use a custom Collector that does no
> sorting and just consumes the results.
> 
> Main bottleneck as I see come from next page search, that took ~2-4
> seconds.

This is because when paging the collector has to re-execute the whole query and 
sort all results again, just with a larger window. So if you have result pages 
of 50000 results and you want to get the second page, it will internally sort 
100000 results, because the first page needs to be calculated, too. If you go 
forward in results the windows gets larger and larger, until it finally 
collects all results.

So just get the results as a stream by implementing the Collector API is the 
right way to do this.

> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: [email protected]
> >
> >> -----Original Message-----
> >> From: Valentin Popov [mailto:[email protected]]
> >> Sent: Thursday, November 12, 2015 6:48 PM
> >> To: [email protected]
> >> Subject: Re: 500 millions document for loop.
> >>
> >> Toke, thanks!
> >>
> >> We will look at this solution, looks like this is that what we need.
> >>
> >>
> >>> On 12 нояб. 2015 г., at 20:42, Toke Eskildsen <[email protected]>
> >> wrote:
> >>>
> >>> Valentin Popov <[email protected]> wrote:
> >>>
> >>>> We have ~10 indexes for 500M documents, each document
> >>>> has «archive date», and «to» address, one of our task is
> >>>> calculate statistics of «to» for last year. Right now we are
> >>>> using search archive_date:(current_date - 1 year) and paginate
> >>>> results for 50k records for page. Bottleneck of that approach,
> >>>> pagination take too long time and on powerful server it take
> >>>> ~20 days to execute, and it is very long.
> >>>
> >>> Lucene does not like deep page requests due to the way the internal
> >> Priority Queue works. Solr has CursorMark, which should be fairly simple
> to
> >> emulate in your Lucene handling code:
> >>>
> >>> http://lucidworks.com/blog/2013/12/12/coming-soon-to-solr-efficient-
> >> cursor-based-iteration-of-large-result-sets/
> >>>
> >>> - Toke Eskildsen
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [email protected]
> >>> For additional commands, e-mail: [email protected]
> >>>
> >>
> >> Regards,
> >> Valentin Popov
> >>
> >>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> 
> 
>  С Уважением,
> Валентин Попов
> 
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: 500 millions document for loop.

Reply via email to